TY - JOUR A1 - Draisbach, Uwe A1 - Christen, Peter A1 - Naumann, Felix T1 - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection JF - ACM Journal of Data and Information Quality N2 - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.
We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available. KW - Record linkage KW - data matching KW - entity resolution KW - deduplication KW - clustering Y1 - 2019 U6 - https://doi.org/10.1145/3352591 SN - 1936-1955 SN - 1936-1963 VL - 12 IS - 1 SP - 1 EP - 30 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Schirmer, Philipp A1 - Papenbrock, Thorsten A1 - Koumarelas, Ioannis A1 - Naumann, Felix T1 - Efficient discovery of matching dependencies JF - ACM transactions on database systems : TODS N2 - Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB. KW - matching dependencies KW - functional dependencies KW - dependency discovery KW - data profiling KW - data matching KW - entity resolution KW - similarity measures Y1 - 2020 U6 - https://doi.org/10.1145/3392778 SN - 0362-5915 SN - 1557-4644 VL - 45 IS - 3 PB - Association for Computing Machinery CY - New York ER -