Transforming pairwise duplicates to entity clusters for high-quality duplicate detection

Draisbach, Uwe; Christen, Peter; Naumann, Felix

doi:10.1145/3352591

Treffer 3 von 3

Zurück zur Trefferliste

Transforming pairwise duplicates to entity clusters for high-quality duplicate detection

Uwe Draisbach, Peter Christen, Felix Naumann

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations.Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.…

Metadaten
Verfasserangaben:	Uwe Draisbach ORCiD GND, Peter Christen ORCiD GND, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.1145/3352591
ISSN:	1936-1955
ISSN:	1936-1963
Titel des übergeordneten Werks (Englisch):	ACM Journal of Data and Information Quality
Verlag:	Association for Computing Machinery
Verlagsort:	New York
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Erstveröffentlichung:	07.12.2019
Erscheinungsjahr:	2019
Datum der Freischaltung:	21.03.2023
Freies Schlagwort / Tag:	Record linkage; clustering; data matching; deduplication; entity resolution
Band:	12
Ausgabe:	1
Aufsatznummer:	3
Seitenanzahl:	30
Erste Seite:	1
Letzte Seite:	30
Organisationseinheiten:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer Review:	Referiert

Transforming pairwise duplicates to entity clusters for high-quality duplicate detection

Metadaten exportieren

Weitere Dienste