• search hit 37 of 400
Back to Result List

An Intermediate Representation for Optimizing Machine Learning Pipelines

  • Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of ourMachine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.show moreshow less

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Andreas KunftORCiDGND, Asterios Katsifodimos, Sebastian SchelterGND, Sebastian Bress, Tilmann RablORCiDGND, Volker MarklGND
DOI:https://doi.org/10.14778/3342263.3342633
ISSN:2150-8097
Title of parent work (English):Proceedings of the VLDB Endowment
Publisher:Association for Computing Machinery
Place of publishing:New York
Publication type:Article
Language:English
Date of first publication:2019/07/01
Publication year:2019
Release date:2021/01/11
Volume:12
Issue:11
Number of pages:15
First page:1553
Last Page:1567
Funding institution:EU project E2Data [780245]; German Ministry for Education and ResearchFederal Ministry of Education & Research (BMBF) [01IS18025A, 01IS18037A]; Moore-Sloan Data Science Environment at New York University
Organizational units:Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme
Peer review:Referiert
Publishing method:Open Access / Green Open-Access
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.