TY  - JOUR
A1  - Caruccio, Loredana
A1  - Deufemia, Vincenzo
A1  - Naumann, Felix
A1  - Polese, Giuseppe
T1  - Discovering relaxed functional dependencies based on multi-attribute dominance
JF  - IEEE transactions on knowledge and data engineering
N2  - With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
KW  - Complexity theory
KW  - Approximation algorithms
KW  - Big Data
KW  - Distributed
KW  - databases
KW  - Semantics
KW  - Lakes
KW  - Functional dependencies
KW  - data profiling
KW  - data cleansing
Y1  - 2020
U6  - https://doi.org/10.1109/TKDE.2020.2967722
SN  - 1041-4347
SN  - 1558-2191
VL  - 33
IS  - 9
SP  - 3212
EP  - 3228
PB  - Institute of Electrical and Electronics Engineers
CY  - New York, NY
ER  - 
TY  - JOUR
A1  - Koßmann, Jan
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - Data dependencies for query optimization
BT  - a survey
JF  - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment
N2  - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
KW  - Query optimization
KW  - Query execution
KW  - Data dependencies
KW  - Data profiling
KW  - Unique column combinations
KW  - Functional dependencies
KW  - Order dependencies
KW  - Inclusion dependencies
KW  - Relational data
KW  - SQL
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00676-3
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 1
EP  - 22
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  - 
TY  - JOUR
A1  - Vitagliano, Gerardo
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Detecting layout templates in complex multiregion files
JF  - Proceedings of the VLDB Endowment
N2  - Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.
Y1  - 2022
U6  - https://doi.org/10.14778/3494124.3494145
SN  - 2150-8097
VL  - 15
IS  - 3
SP  - 646
EP  - 658
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Loster, Michael
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Knowledge transfer for entity resolution with siamese neural networks
JF  - ACM journal of data and information quality
N2  - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. <br /> We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
KW  - Entity resolution
KW  - duplicate detection
KW  - transfer learning
KW  - neural
KW  - networks
KW  - metric learning
KW  - similarity learning
KW  - data quality
Y1  - 2021
U6  - https://doi.org/10.1145/3410157
SN  - 1936-1955
SN  - 1936-1963
VL  - 13
IS  - 1
PB  - Association for Computing Machinery
CY  - New York
ER  -