• search hit 9 of 15
Back to Result List

Data preparation for duplicate detection

  • Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations,Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.show moreshow less

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Ioannis KoumarelasORCiDGND, Lan JiangORCiDGND, Felix NaumannORCiDGND
DOI:https://doi.org/10.1145/3377878
ISSN:1936-1955
ISSN:1936-1963
Title of parent work (English):Journal of data and information quality : (JDIQ)
Publisher:Association for Computing Machinery
Place of publishing:New York
Publication type:Article
Language:English
Date of first publication:2020/06/13
Publication year:2020
Release date:2023/01/18
Tag:data preparation; data wrangling; duplicate detection; record linkage; similarity measures
Volume:12
Issue:3
Article number:15
Number of pages:24
Organizational units:An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
0 Informatik, Informationswissenschaft, allgemeine Werke / 02 Bibliotheks- und Informationswissenschaften / 020 Bibliotheks- und Informationswissenschaften
Peer review:Referiert
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.