TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - MDedup
BT  - duplicate detection with matching dependencies
JF  - Proceedings of the VLDB Endowment
N2  - Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. 
For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.
Y1  - 2020
U6  - https://doi.org/10.14778/3377369.3377379
SN  - 2150-8097
VL  - 13
IS  - 5
SP  - 712
EP  - 725
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Data preparation for duplicate detection
JF  - Journal of data and information quality : (JDIQ)
N2  - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW  - data preparation
KW  - data wrangling
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3377878
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - BOOK
A1  - Draisbach, Uwe
A1  - Naumann, Felix
A1  - Szott, Sascha
A1  - Wonneberg, Oliver
T1  - Adaptive windows for duplicate detection
N2  - Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
N2  - Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 49 
KW  - Informationssysteme
KW  - Datenqualität
KW  - Datenintegration
KW  - Duplikaterkennung
KW  - Duplicate Detection
KW  - Data Quality
KW  - Data Integration
KW  - Information Systems
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-53007
SN  - 978-3-86956-143-1
SN  - 1613-5652
SN  - 2191-1665
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Kroschk, Axel
A1  - Mosley, Clifford
A1  - Naumann, Felix
T1  - Experience: Enhancing address matching with geocoding and similarity measure selection
JF  - Journal of Data and Information Quality
N2  - Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
KW  - Address matching
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
KW  - conditional functional dependencies
KW  - address normalization
KW  - address parsing
KW  - geocoding
KW  - geographic information systems
KW  - random forest
Y1  - 2018
U6  - https://doi.org/10.1145/3232852
SN  - 1936-1955
VL  - 10
IS  - 2
SP  - 1
EP  - 16
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Hameed, Mazhar
A1  - Naumann, Felix
T1  - Data Preparation
BT  - a survey of commercial tools
JF  - SIGMOD record
N2  - Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. <br /> To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.
KW  - data quality
KW  - data cleaning
KW  - data wrangling
Y1  - 2020
U6  - https://doi.org/10.1145/3444831.3444835
SN  - 0163-5808
SN  - 1943-5835
VL  - 49
IS  - 3
SP  - 18
EP  - 29
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Schirmer, Philipp
A1  - Papenbrock, Thorsten
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Efficient discovery of matching dependencies
JF  - ACM transactions on database systems : TODS
N2  - Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. 
We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.
KW  - matching dependencies
KW  - functional dependencies
KW  - dependency discovery
KW  - data profiling
KW  - data matching
KW  - entity resolution
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3392778
SN  - 0362-5915
SN  - 1557-4644
VL  - 45
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Hacker, Philipp
A1  - Krestel, Ralf
A1  - Grundmann, Stefan
A1  - Naumann, Felix
T1  - Explainable AI under contract and tort law
BT  - legal incentives and technical challenges
JF  - Artificial intelligence and law
N2  - This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.
KW  - explainability
KW  - explainable AI
KW  - interpretable machine learning
KW  - contract
KW  - law
KW  - tort law
KW  - explainability-accuracy trade-off
KW  - medical malpractice
KW  - corporate takeovers
Y1  - 2020
U6  - https://doi.org/10.1007/s10506-020-09260-6
SN  - 0924-8463
SN  - 1572-8382
VL  - 28
IS  - 4
SP  - 415
EP  - 439
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Draisbach, Uwe
A1  - Christen, Peter
A1  - Naumann, Felix
T1  - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection
JF  - ACM Journal of Data and Information Quality
N2  - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
KW  - Record linkage
KW  - data matching
KW  - entity resolution
KW  - deduplication
KW  - clustering
Y1  - 2019
U6  - https://doi.org/10.1145/3352591
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 1
SP  - 1
EP  - 30
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - BOOK
A1  - Lange, Dustin
A1  - Böhm, Christoph
A1  - Naumann, Felix
T1  - Extracting structured information from Wikipedia articles to populate infoboxes
N2  - Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.
N2  - Ungefähr jeder dritte Wikipedia-Artikel enthält eine Infobox - eine Tabelle, die wichtige Fakten über das beschriebene Thema in Attribut-Wert-Form darstellt. Das Schema einer Infobox, d.h. die Attribute, die für ein Konzept verwendet werden können, wird durch ein Infobox-Template definiert. Häufig geben Autoren nicht für alle Template-Attribute Werte an, wodurch unvollständige Infoboxen entstehen. Mit iPopulator stellen wir ein System vor, welches automatisch Infoboxen von Wikipedia-Artikeln durch Extrahieren von Attributwerten aus dem Artikeltext befüllt. Im Unterschied zu früheren Arbeiten erkennt iPopulator die Struktur von Attributwerten und nutzt diese aus, um die einzelnen Bestandteile von Attributwerten unabhängig voneinander zu extrahieren. Wir haben iPopulator auf der gesamten Menge der Infobox-Templates getestet und analysieren detailliert die Effektivität. Wir erreichen beispielsweise für die Extraktion einen durchschnittlichen Precision-Wert von 91% für 1.727 verschiedene Infobox-Template-Attribute.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 38 
KW  - Informationsextraktion
KW  - Wikipedia
KW  - Linked Data
KW  - Information Extraction
KW  - Wikipedia
KW  - Linked Data
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-45714
SN  - 978-3-86956-081-6
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Leser, Ulf
A1  - Naumann, Felix
T1  - Efficient and exact computation of inclusion dependencies for data integration
N2  - Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 34 
KW  - Metadatenentdeckung
KW  - Metadatenqualität
KW  - Schemaentdeckung
KW  - Datenanalyse
KW  - Datenintegration
KW  - metadata discovery
KW  - metadata quality
KW  - schema discovery
KW  - data profiling
KW  - data integration
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-41396
SN  - 978-3-86956-048-9
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  -