TY  - JOUR
A1  - Koßmann, Jan
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - Data dependencies for query optimization
BT  - a survey
JF  - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment
N2  - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
KW  - Query optimization
KW  - Query execution
KW  - Data dependencies
KW  - Data profiling
KW  - Unique column combinations
KW  - Functional dependencies
KW  - Order dependencies
KW  - Inclusion dependencies
KW  - Relational data
KW  - SQL
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00676-3
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 1
EP  - 22
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  - 
TY  - JOUR
A1  - Caruccio, Loredana
A1  - Deufemia, Vincenzo
A1  - Naumann, Felix
A1  - Polese, Giuseppe
T1  - Discovering relaxed functional dependencies based on multi-attribute dominance
JF  - IEEE transactions on knowledge and data engineering
N2  - With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
KW  - Complexity theory
KW  - Approximation algorithms
KW  - Big Data
KW  - Distributed
KW  - databases
KW  - Semantics
KW  - Lakes
KW  - Functional dependencies
KW  - data profiling
KW  - data cleansing
Y1  - 2020
U6  - https://doi.org/10.1109/TKDE.2020.2967722
SN  - 1041-4347
SN  - 1558-2191
VL  - 33
IS  - 9
SP  - 3212
EP  - 3228
PB  - Institute of Electrical and Electronics Engineers
CY  - New York, NY
ER  - 
TY  - THES
A1  - Harmouch, Hazar
T1  - Single-column data profiling
N2  - The research area of data profiling consists of a large set of methods and processes to examine a given dataset and determine metadata about it. Typically, different data profiling tasks address different kinds of metadata, comprising either various statistics about individual columns (Single-column Analysis) or relationships among them (Dependency Discovery). Among the basic statistics about a column are data type, header, the number of unique values (the column's cardinality), maximum and minimum values, the number of null values, and the value distribution. Dependencies involve, for instance, functional dependencies (FDs), inclusion dependencies (INDs), and their approximate versions.
Data profiling has a wide range of conventional use cases, namely data exploration, cleansing, and integration. The produced metadata is also useful for database management and schema reverse engineering. Data profiling has also more novel use cases, such as big data analytics. The generated metadata describes the structure of the data at hand, how to import it, what it is about, and how much of it there is. Thus, data profiling can be considered as an important preparatory task for many data analysis and mining scenarios to assess which data might be useful and to reveal and understand a new dataset's characteristics.
In this thesis, the main focus is on the single-column analysis class of data profiling tasks. We study the impact and the extraction of three of the most important metadata about a column, namely the cardinality, the header, and the number of null values. 
First, we present a detailed experimental study of twelve cardinality estimation algorithms. We classify the algorithms and analyze their efficiency, scaling far beyond the original experiments and testing theoretical guarantees. Our results highlight their trade-offs and point out the possibility to create a parallel or a distributed version of these algorithms to cope with the growing size of modern datasets.  
Then, we present a fully automated, multi-phase system to discover human-understandable, representative, and consistent headers for a target table in cases where headers are missing, meaningless, or unrepresentative for the column values. Our evaluation on Wikipedia tables shows that 60% of the automatically discovered schemata are exact and complete. Considering  more  schema  candidates,  top-5  for  example,  increases  this  percentage  to 72%.
Finally, we formally and experimentally show the ghost and fake FDs phenomenon caused by FD discovery over datasets with missing values. We propose two efficient scores, probabilistic and likelihood-based, for estimating the genuineness of a discovered FD.  Our extensive set of experiments on real-world and semi-synthetic datasets show the effectiveness and efficiency of these scores.
N2  - Das Forschungsgebiet Data Profiling besteht aus einer Vielzahl von Methoden und Prozessen, die es erlauben Datensätze zu untersuchen und Metadaten über diese zu ermitteln. Typischerweise erzeugen verschiedene Data-Profiling-Techniken unterschiedliche Arten von Metadaten, die entweder verschiedene Statistiken einzelner Spalten (Single-Column Analysis) oder Beziehungen zwischen diesen (Dependency Discovery) umfassen. Zu den grundlegenden Statistiken einer Spalte gehören unter anderem ihr Datentyp, ihr Name, die Anzahl eindeutiger Werte (Kardinalität der Spalte), Maximal- und Minimalwerte, die Anzahl an Null-Werten sowie ihre Werteverteilung. Im Falle von Abhängigkeiten kann es sich beispielsweise um funktionale Abhängigkeiten (FDs),  Inklusionsabhängigkeiten (INDs) sowie deren approximative Varianten handeln. 
Data Profiling besitzt vielfältige Anwendungsmöglichkeiten, darunter fallen die Datenexploration, -bereinigung und -integration. Darüber hinaus sind die erzeugten Metadaten sowohl für den Einsatz in Datenbankmanagementsystemen als auch für das Reverse Engineering von Datenbankschemata hilfreich. Weiterhin finden Methoden des Data Profilings immer häufiger Verwendung in neuartigen Anwendungsfällen, wie z.B. der Analyse von Big Data. Dabei beschreiben die generierten Metadaten die Struktur der vorliegenden Daten, wie diese zu importieren sind, von was sie handeln und welchen Umfang sie haben. Somit kann das Profiling von Datenbeständen als eine wichtige, vorbereitende Aufgabe für viele Datenanalyse- und Data-Mining Szenarien angesehen werden. Sie ermöglicht die Beurteilung, welche Daten nützlich sein könnten, und erlaubt es zudem die Eigenschaften eines neuen Datensatzes aufzudecken und zu verstehen.
Der Schwerpunkt dieser Arbeit bildet das Single-Column Profiling. Dabei werden sowohl die Auswirkungen als auch die Extraktion von drei der wichtigsten Metadaten einer Spalte untersucht, nämlich ihrer Kardinalität, ihres Namens und ihrer Anzahl an Null-Werten.
Die vorliegende Arbeit beginnt mit einer detaillierten experimentellen Studie von zwölf Algorithmen zur Kardinalitätsschätzung. Diese Studie klassifiziert die Algorithmen anhand verschiedener Kriterien und analysiert ihre Effizienz. Dabei sind die Experimente im Vergleich zu den Originalpublikationen weitaus umfassender und testen die theoretischen Garantien der untersuchten Algorithmen. Unsere Ergebnisse geben Aufschluss über Abwägungen zwischen den Algorithmen und weisen zudem auf die Möglichkeit einer parallelen bzw. verteilten Algorithmenversion hin, wodurch die stetig anwachsende Datenmenge moderner Datensätze bewältigt werden könnten.  
Anschließend wird ein vollautomatisches, mehrstufiges System vorgestellt, mit dem sich im Falle fehlender, bedeutungsloser oder nicht repräsentativer Kopfzeilen einer Zieltabelle menschenverständliche, repräsentative und konsistente Kopfzeilen ermitteln lassen. Unsere Auswertung auf Wikipedia-Tabellen zeigt, dass 60% der automatisch entdeckten Schemata exakt und vollständig sind. Werden darüber hinaus mehr Schemakandidaten in Betracht gezogen, z.B. die Top-5, erhöht sich dieser Prozentsatz auf 72%.
Schließlich wird das Phänomen der Geist- und Schein-FDs formell und experimentell untersucht, welches bei der Entdeckung von FDs auf Datensätzen mit fehlenden Werten auftreten kann. Um die Echtheit einer entdeckten FD effizient abzuschätzen, schlagen wir sowohl eine probabilistische als auch eine wahrscheinlichkeitsbasierte Bewertungsmethode vor. Die Wirksamkeit und Effizienz beider Bewertungsmethoden zeigt sich in unseren umfangreichen Experimenten mit realen und halbsynthetischen Datensätzen.
KW  - Data profiling
KW  - Functional dependencies
KW  - Data quality
KW  - Schema discovery
KW  - Cardinality estimation
KW  - Metanome
KW  - Missing values
KW  - Kardinalitätsschätzung
KW  - Datenqualität
KW  - Funktionale Abhängigkeiten
KW  - Fehlende Werte
KW  - Schema-Entdeckung
Y1  - 2020
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-474554
ER  -