TY  - JOUR
A1  - Schirmer, Philipp
A1  - Papenbrock, Thorsten
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Efficient discovery of matching dependencies
JF  - ACM transactions on database systems : TODS
N2  - Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. 
We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.
KW  - matching dependencies
KW  - functional dependencies
KW  - dependency discovery
KW  - data profiling
KW  - data matching
KW  - entity resolution
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3392778
SN  - 0362-5915
SN  - 1557-4644
VL  - 45
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - THES
A1  - Ong, James Kwan Yau
T1  - The predictability problem
T1  - Das Vorhersagbarkeitsproblem
N2  - Wir versuchen herauszufinden, ob das subjektive Maß der Cloze-Vorhersagbarkeit mit der Kombination objektiver Maße (semantische und n-gram-Maße) geschätzt werden kann, die auf den statistischen Eigenschaften von Textkorpora beruhen. Die semantischen Maße werden entweder durch Abfragen von Internet-Suchmaschinen oder durch die Anwendung der Latent Semantic Analysis gebildet, während die n-gram-Wortmaße allein auf den Ergebnissen von Internet-Suchmaschinen basieren. Weiterhin untersuchen wir die Rolle der Cloze-Vorhersagbarkeit in SWIFT, einem Modell der Blickkontrolle, und wägen ab, ob andere Parameter den der Vorhersagbarkeit ersetzen können. Unsere Ergebnisse legen nahe, dass ein computationales Modell, welches Vorhersagbarkeitswerte berechnet, nicht nur Maße beachten muss, die die Relatiertheit eines Wortes zum Kontext darstellen; das Vorhandensein eines Maßes bezüglich der Nicht-Relatiertheit ist von ebenso großer Bedeutung. Obwohl hier jedoch nur Relatiertheits-Maße zur Verfügung stehen, sollte SWIFT ebensogute Ergebnisse liefern, wenn wir Cloze-Vorhersagbarkeit mit unseren Maßen ersetzen.
N2  - We try to determine whether it is possible to approximate the subjective Cloze predictability measure with two types of objective measures, semantic and word n-gram measures, based on the statistical properties of text corpora. The semantic measures are constructed either by querying Internet search engines or by applying Latent Semantic Analysis, while the word n-gram measures solely depend on the results of Internet search engines. We also analyse the role of Cloze predictability in the SWIFT eye movement model, and evaluate whether other parameters might be able to take the place of predictability. Our results suggest that a computational model that generates predictability values not only needs to use measures that can determine the relatedness of a word to its context; the presence of measures that assert unrelatedness is just as important. In spite of the fact, however, that we only have similarity measures, we predict that SWIFT should perform just as well when we replace Cloze predictability with our measures.
KW  - Cloze-Vorhersagbarkeit
KW  - Blickbewegungen
KW  - Latente-Semantische-Analyse
KW  - Wort-n-Gramme-Wahrscheinlichkeit
KW  - Ähnlichkeit-Masse
KW  - Cloze predictability
KW  - eye movements
KW  - Latent Semantic Analysis
KW  - word n-gram probability
KW  - similarity measures
Y1  - 2007
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-15025
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Kroschk, Axel
A1  - Mosley, Clifford
A1  - Naumann, Felix
T1  - Experience: Enhancing address matching with geocoding and similarity measure selection
JF  - Journal of Data and Information Quality
N2  - Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
KW  - Address matching
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
KW  - conditional functional dependencies
KW  - address normalization
KW  - address parsing
KW  - geocoding
KW  - geographic information systems
KW  - random forest
Y1  - 2018
U6  - https://doi.org/10.1145/3232852
SN  - 1936-1955
VL  - 10
IS  - 2
SP  - 1
EP  - 16
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Data preparation for duplicate detection
JF  - Journal of data and information quality : (JDIQ)
N2  - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW  - data preparation
KW  - data wrangling
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3377878
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  -