Data preparation for duplicate detection

Koumarelas, Ioannis; Jiang, Lan; Naumann, Felix

doi:10.1145/3377878

Treffer 2 von 23

Zurück zur Trefferliste

Data preparation for duplicate detection

Ioannis Koumarelas, Lan Jiang, Felix Naumann

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations,Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.…

Metadaten
Verfasserangaben:	Ioannis Koumarelas ORCiD GND, Lan Jiang ORCiD GND, Felix Naumann ORCiD GND
DOI:	https://doi.org/10.1145/3377878
ISSN:	1936-1955
ISSN:	1936-1963
Titel des übergeordneten Werks (Englisch):	Journal of data and information quality : (JDIQ)
Verlag:	Association for Computing Machinery
Verlagsort:	New York
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Erstveröffentlichung:	13.06.2020
Erscheinungsjahr:	2020
Datum der Freischaltung:	18.01.2023
Freies Schlagwort / Tag:	data preparation; data wrangling; duplicate detection; record linkage; similarity measures
Band:	12
Ausgabe:	3
Aufsatznummer:	15
Seitenanzahl:	24
Organisationseinheiten:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
	0 Informatik, Informationswissenschaft, allgemeine Werke / 02 Bibliotheks- und Informationswissenschaften / 020 Bibliotheks- und Informationswissenschaften
Peer Review:	Referiert

Data preparation for duplicate detection

Metadaten exportieren

Weitere Dienste