TY - JOUR
A1 - Loster, Michael
A1 - Koumarelas, Ioannis
A1 - Naumann, Felix
T1 - Knowledge transfer for entity resolution with siamese neural networks
JF - ACM journal of data and information quality
N2 - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.
We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
KW - Entity resolution
KW - duplicate detection
KW - transfer learning
KW - neural
KW - networks
KW - metric learning
KW - similarity learning
KW - data quality
Y1 - 2021
U6 - https://doi.org/10.1145/3410157
SN - 1936-1955
SN - 1936-1963
VL - 13
IS - 1
PB - Association for Computing Machinery
CY - New York
ER -
TY - JOUR
A1 - Koumarelas, Ioannis
A1 - Jiang, Lan
A1 - Naumann, Felix
T1 - Data preparation for duplicate detection
JF - Journal of data and information quality : (JDIQ)
N2 - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.
Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW - data preparation
KW - data wrangling
KW - record linkage
KW - duplicate detection
KW - similarity measures
Y1 - 2020
U6 - https://doi.org/10.1145/3377878
SN - 1936-1955
SN - 1936-1963
VL - 12
IS - 3
PB - Association for Computing Machinery
CY - New York
ER -