An Intermediate Representation for Optimizing Machine Learning Pipelines

Kunft, Andreas; Katsifodimos, Asterios; Schelter, Sebastian; Bress, Sebastian; Rabl, Tilmann; Markl, Volker

doi:10.14778/3342263.3342633

Das Suchergebnis hat sich seit Ihrer Suchanfrage verändert. Eventuell werden Dokumente in anderer Reihenfolge angezeigt.

Treffer 4 von 5

Zurück zur Trefferliste

An Intermediate Representation for Optimizing Machine Learning Pipelines

Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Bress, Tilmann Rabl, Volker Markl

Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of ourMachine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.…

Metadaten
Verfasserangaben:	Andreas Kunft ORCiD GND, Asterios Katsifodimos, Sebastian Schelter GND, Sebastian Bress, Tilmann Rabl ORCiD GND, Volker Markl GND
DOI:	https://doi.org/10.14778/3342263.3342633
ISSN:	2150-8097
Titel des übergeordneten Werks (Englisch):	Proceedings of the VLDB Endowment
Verlag:	Association for Computing Machinery
Verlagsort:	New York
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Erstveröffentlichung:	01.07.2019
Erscheinungsjahr:	2019
Datum der Freischaltung:	11.01.2021
Band:	12
Ausgabe:	11
Seitenanzahl:	15
Erste Seite:	1553
Letzte Seite:	1567
Fördernde Institution:	EU project E2Data [780245]; German Ministry for Education and ResearchFederal Ministry of Education & Research (BMBF) [01IS18025A, 01IS18037A]; Moore-Sloan Data Science Environment at New York University
Organisationseinheiten:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme
Peer Review:	Referiert
Publikationsweg:	Open Access / Green Open-Access

An Intermediate Representation for Optimizing Machine Learning Pipelines

Metadaten exportieren

Weitere Dienste