TY  - GEN
A1  - Kruse, Sebastian
A1  - Kaoudi, Zoi
A1  - Quiane-Ruiz, Jorge-Arnulfo
A1  - Chawla, Sanjay
A1  - Naumann, Felix
A1  - Contreras-Rojas, Bertty
T1  - Optimizing Cross-Platform Data Movement
T2  - 2019 IEEE 35th International Conference on Data Engineering (ICDE)
N2  - Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner. For the optimizer to be effective, it must consider data movement costs across different data processing platforms. In this paper, we present the graph-based data movement strategy used by RHEEM, our open-source cross-platform system. In particular, we (i) model the data movement problem as a new graph problem, which we prove to be NP-hard, and (ii) propose a novel graph exploration algorithm, which allows RHEEM to discover multiple hidden opportunities for cross-platform data processing.
Y1  - 2019
SN  - 978-1-5386-7474-1
SN  - 978-1-5386-7475-8
U6  - https://doi.org/10.1109/ICDE.2019.00162
SN  - 1084-4627
SN  - 1063-6382
SP  - 1642
EP  - 1645
PB  - IEEE
CY  - New York
ER  - 
TY  - JOUR
A1  - Draisbach, Uwe
A1  - Christen, Peter
A1  - Naumann, Felix
T1  - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection
JF  - ACM Journal of Data and Information Quality
N2  - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
KW  - Record linkage
KW  - data matching
KW  - entity resolution
KW  - deduplication
KW  - clustering
Y1  - 2019
U6  - https://doi.org/10.1145/3352591
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 1
SP  - 1
EP  - 30
PB  - Association for Computing Machinery
CY  - New York
ER  -