TY  - JOUR
A1  - Gévay, Gábor E.
A1  - Rabl, Tilmann
A1  - Breß, Sebastian
A1  - Madai-Tahy, Loránd
A1  - Quiané-Ruiz, Jorge-Arnulfo
A1  - Markl, Volker
T1  - Imperative or functional control flow handling
BT  - why not the best of both worlds?
JF  - SIGMOD record / Association for Computing Machinery, Special Interest Group on Management of Data
N2  - Modern data analysis tasks often involve control flow statements, such as the iterations in PageRank and K-means. To achieve scalability, developers usually implement these tasks in distributed dataflow systems, such as Spark and Flink. Designers of such systems have to choose between providing imperative or functional control flow constructs to users. Imperative constructs are easier to use, but functional constructs are easier to compile to an efficient dataflow job. We propose Mitos, a system where control flow is both easy to use and efficient. Mitos relies on an intermediate representation based on the static single assignment form. This allows us to abstract away from specific control flow constructs and treat any imperative control flow uniformly both when building the dataflow job and when coordinating the distributed execution.
Y1  - 2022
U6  - https://doi.org/10.1145/3542700.3542715
SN  - 0163-5808
VL  - 51
IS  - 1
SP  - 60
EP  - 67
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Gevay, Gabor E.
A1  - Rabl, Tilmann
A1  - Bress, Sebastian
A1  - Maclai-Tahy, Lorand
A1  - Quiane-Ruiz, Jorge-Arnulfo
A1  - Markl, Volker
T1  - Imperative or Functional Control Flow Handling: Why not the Best of Both Worlds?
JF  - SIGMOD record
N2  - Modern data analysis tasks often involve control flow statements, such as the iterations in PageRank and K-means. To achieve scalability, developers usually implement these tasks in distributed dataflow systems, such as Spark and Flink. Designers of such systems have to choose between providing imperative or functional control flow constructs to users. Imperative constructs are easier to use, but functional constructs are easier to compile to an efficient dataflow job. We propose Mitos, a system where control flow is both easy to use and efficient. Mitos relies on an intermediate representation based on the static single assignment form. This allows us to abstract away from specific control flow constructs and treat any imperative control flow uniformly both when building the dataflow job and when coordinating the distributed execution.
Y1  - 2022
U6  - https://doi.org/10.1109/ICDE51399.2021.00127
SN  - 0163-5808
SN  - 1943-5835
VL  - 51
IS  - 1
SP  - 60
EP  - 67
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Kaitoua, Abdulrahman
A1  - Rabl, Tilmann
A1  - Markl, Volker
T1  - A distributed data exchange engine for polystores
JF  - Information technology : methods and applications of informatics and information technology
JF  - Information technology : Methoden und innovative Anwendungen der Informatik und Informationstechnik
N2  - There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.
Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines' instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.
KW  - distributed systems
KW  - data migration
KW  - data transformation
KW  - big data
KW  - engine
KW  - data integration
Y1  - 2020
U6  - https://doi.org/10.1515/itit-2019-0037
SN  - 1611-2776
SN  - 2196-7032
VL  - 62
IS  - 3-4
SP  - 145
EP  - 156
PB  - De Gruyter
CY  - Berlin
ER  - 
TY  - JOUR
A1  - Kunft, Andreas
A1  - Katsifodimos, Asterios
A1  - Schelter, Sebastian
A1  - Bress, Sebastian
A1  - Rabl, Tilmann
A1  - Markl, Volker
T1  - An Intermediate Representation for Optimizing Machine Learning Pipelines
JF  - Proceedings of the VLDB Endowment
N2  - Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's inter-mediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
Y1  - 2019
U6  - https://doi.org/10.14778/3342263.3342633
SN  - 2150-8097
VL  - 12
IS  - 11
SP  - 1553
EP  - 1567
PB  - Association for Computing Machinery
CY  - New York
ER  -