• Treffer 2 von 23
Zurück zur Trefferliste

Data preparation for duplicate detection

  • Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations,Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.zeige mehrzeige weniger

Metadaten exportieren

Weitere Dienste

Suche bei Google Scholar Statistik - Anzahl der Zugriffe auf das Dokument
Metadaten
Verfasserangaben:Ioannis KoumarelasORCiDGND, Lan JiangORCiDGND, Felix NaumannORCiDGND
DOI:https://doi.org/10.1145/3377878
ISSN:1936-1955
ISSN:1936-1963
Titel des übergeordneten Werks (Englisch):Journal of data and information quality : (JDIQ)
Verlag:Association for Computing Machinery
Verlagsort:New York
Publikationstyp:Wissenschaftlicher Artikel
Sprache:Englisch
Datum der Erstveröffentlichung:13.06.2020
Erscheinungsjahr:2020
Datum der Freischaltung:18.01.2023
Freies Schlagwort / Tag:data preparation; data wrangling; duplicate detection; record linkage; similarity measures
Band:12
Ausgabe:3
Aufsatznummer:15
Seitenanzahl:24
Organisationseinheiten:An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC-Klassifikation:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
0 Informatik, Informationswissenschaft, allgemeine Werke / 02 Bibliotheks- und Informationswissenschaften / 020 Bibliotheks- und Informationswissenschaften
Peer Review:Referiert
Verstanden ✔
Diese Webseite verwendet technisch erforderliche Session-Cookies. Durch die weitere Nutzung der Webseite stimmen Sie diesem zu. Unsere Datenschutzerklärung finden Sie hier.