TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Data preparation for duplicate detection
JF  - Journal of data and information quality : (JDIQ)
N2  - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW  - data preparation
KW  - data wrangling
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3377878
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Hacker, Philipp
A1  - Krestel, Ralf
A1  - Grundmann, Stefan
A1  - Naumann, Felix
T1  - Explainable AI under contract and tort law
BT  - legal incentives and technical challenges
JF  - Artificial intelligence and law
N2  - This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.
KW  - explainability
KW  - explainable AI
KW  - interpretable machine learning
KW  - contract
KW  - law
KW  - tort law
KW  - explainability-accuracy trade-off
KW  - medical malpractice
KW  - corporate takeovers
Y1  - 2020
U6  - https://doi.org/10.1007/s10506-020-09260-6
SN  - 0924-8463
SN  - 1572-8382
VL  - 28
IS  - 4
SP  - 415
EP  - 439
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Koßmann, Jan
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - Data dependencies for query optimization
BT  - a survey
JF  - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment
N2  - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
KW  - Query optimization
KW  - Query execution
KW  - Data dependencies
KW  - Data profiling
KW  - Unique column combinations
KW  - Functional dependencies
KW  - Order dependencies
KW  - Inclusion dependencies
KW  - Relational data
KW  - SQL
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00676-3
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 1
EP  - 22
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  - 
TY  - JOUR
A1  - Bonifati, Angela
A1  - Mior, Michael J.
A1  - Naumann, Felix
A1  - Noack, Nele Sina
T1  - How inclusive are we?
BT  - an analysis of gender diversity in database venues
JF  - SIGMOD record / Association for Computing Machinery, Special Interest Group on Management of Data
N2  - ACM SIGMOD, VLDB and other database organizations have committed to fostering an inclusive and diverse community, as do many other scientific organizations. Recently, different measures have been taken to advance these goals, especially for underrepresented groups. One possible measure is double-blind reviewing, which aims to hide gender, ethnicity, and other properties of the authors. <br /> We report the preliminary results of a gender diversity analysis of publications of the database community across several peer-reviewed venues, and also compare women's authorship percentages in both single-blind and double-blind venues along the years. We also obtained a cross comparison of the obtained results in data management with other relevant areas in Computer Science.
Y1  - 2022
U6  - https://doi.org/10.1145/3516431.3516438
SN  - 0163-5808
SN  - 1943-5835
VL  - 50
IS  - 4
SP  - 30
EP  - 35
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Vitagliano, Gerardo
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Detecting layout templates in complex multiregion files
JF  - Proceedings of the VLDB Endowment
N2  - Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.
Y1  - 2022
U6  - https://doi.org/10.14778/3494124.3494145
SN  - 2150-8097
VL  - 15
IS  - 3
SP  - 646
EP  - 658
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Loster, Michael
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Knowledge transfer for entity resolution with siamese neural networks
JF  - ACM journal of data and information quality
N2  - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. <br /> We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
KW  - Entity resolution
KW  - duplicate detection
KW  - transfer learning
KW  - neural
KW  - networks
KW  - metric learning
KW  - similarity learning
KW  - data quality
Y1  - 2021
U6  - https://doi.org/10.1145/3410157
SN  - 1936-1955
SN  - 1936-1963
VL  - 13
IS  - 1
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Bonnet, Philippe
A1  - Dong, Xin Luna
A1  - Naumann, Felix
A1  - Tözün, Pınar
T1  - VLDB 2021
BT  - Designing a hybrid conference
JF  - SIGMOD record
N2  - The 47th International Conference on Very Large Databases (VLDB'21) was held on August 16-20, 2021 as a hybrid conference. It attracted 180 in-person attendees in Copenhagen and 840 remote attendees. In this paper, we describe our key decisions as general chairs and program committee chairs and share the lessons we learned.
Y1  - 2021
U6  - https://doi.org/10.1145/3516431.3516447
SN  - 0163-5808
SN  - 1943-5835
VL  - 50
IS  - 4
SP  - 50
EP  - 53
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Vitagliano, Gerardo
A1  - Hameed, Mazhar
A1  - Jiang, Lan
A1  - Reisener, Lucas
A1  - Wu, Eugene
A1  - Naumann, Felix
T1  - Pollock: a data loading benchmark
JF  - Proceedings of the VLDB Endowment
N2  - Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is CSV. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard CSV formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic lpollutionz process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.
Y1  - 2023
U6  - https://doi.org/10.14778/3594512.3594518
SN  - 2150-8097
VL  - 16
IS  - 8
SP  - 1870
EP  - 1882
PB  - Association for Computing Machinery
CY  - New York
ER  -