MDedup : duplicate detection with matching dependencies

Koumarelas, Ioannis; Papenbrock, Thorsten; Naumann, Felix

doi:10.14778/3377369.3377379

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision onDuplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.… zeige mehr

Verfasserangaben:	Ioannis Koumarelas ORCiD GND, Thorsten Papenbrock ORCiD GND, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.14778/3377369.3377379
ISSN:	2150-8097
Titel des übergeordneten Werks (Englisch):	Proceedings of the VLDB Endowment
Untertitel (Englisch):	duplicate detection with matching dependencies
Verlag:	Association for Computing Machinery
Verlagsort:	New York
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Erstveröffentlichung:	19.02.2020
Erscheinungsjahr:	2020
Datum der Freischaltung:	12.01.2023
Band:	13
Ausgabe:	5
Seitenanzahl:	14
Erste Seite:	712
Letzte Seite:	725
Organisationseinheiten:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer Review:	Referiert

MDedup

Metadaten exportieren

Weitere Dienste