Efficient discovery of matching dependencies

Schirmer, Philipp; Papenbrock, Thorsten; Koumarelas, Ioannis; Naumann, Felix

doi:10.1145/3392778

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results thatMatching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.… show more

Author details:	Philipp Schirmer, Thorsten Papenbrock ORCiD GND, Ioannis Koumarelas ORCiD GND, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.1145/3392778
ISSN:	0362-5915
ISSN:	1557-4644
Title of parent work (English):	ACM transactions on database systems : TODS
Publisher:	Association for Computing Machinery
Place of publishing:	New York
Publication type:	Article
Language:	English
Date of first publication:	2020/08/26
Publication year:	2020
Release date:	2023/01/12
Tag:	data matching; data profiling; dependency discovery; entity resolution; functional dependencies; matching dependencies; similarity measures
Volume:	45
Issue:	3
Article number:	13
Number of pages:	33
Organizational units:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:	5 Naturwissenschaften und Mathematik / 51 Mathematik / 510 Mathematik
Peer review:	Referiert

Efficient discovery of matching dependencies

Export metadata

Additional Services