TY  - GEN
A1  - Shaabani, Nuhad
A1  - Meinel, Christoph
T1  - Improving the efficiency of inclusion dependency detection
T2  - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
N2  - The detection of all inclusion dependencies (INDs) in an unknown dataset is at the core of any data profiling effort. Apart from the discovery of foreign key relationships, INDs can help perform data integration, integrity checking, schema (re-)design, and query optimization. With the advent of Big Data, the demand increases for efficient INDs discovery algorithms that can scale with the input data size. To this end, we propose S-INDD++ as a scalable system for detecting unary INDs in large datasets. S-INDD++ applies a new stepwise partitioning technique that helps discard a large number of attributes in early phases of the detection by processing the first partitions of smaller sizes. S-INDD++ also extends the concept of the attribute clustering to decide which attributes to be discarded based on the clustering result of each partition. Moreover, in contrast to the state-of-the-art, S-INDD++ does not require the partition to fit into the main memory-which is a highly appreciable property in the face of the ever growing datasets. We conducted an exhaustive evaluation of S-INDD++ by applying it to large datasets with thousands attributes and more than 266 million tuples. The results show the high superiority of S-INDD++ over the state-of-the-art. S-INDD++ reduced up to 50 % of the runtime in comparison with BINDER, and up to 98 % in comparison with S-INDD.
KW  - Algorithms
KW  - Data partitioning
KW  - Data profiling
KW  - Data mining
Y1  - 2018
SN  - 978-1-4503-6014-2
U6  - https://doi.org/10.1145/3269206.3271724
SP  - 207
EP  - 216
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Bläsius, Thomas
A1  - Friedrich, Tobias
A1  - Lischeid, Julius
A1  - Meeks, Kitty
A1  - Schirneck, Friedrich Martin
T1  - Efficiently enumerating hitting sets of hypergraphs arising in data profiling
JF  - Journal of computer and system sciences : JCSS
N2  - The transversal hypergraph problem asks to enumerate the minimal hitting sets of a hypergraph. If the solutions have bounded size, Eiter and Gottlob [SICOMP'95] gave an algorithm running in output-polynomial time, but whose space requirement also scales with the output. We improve this to polynomial delay and space. Central to our approach is the extension problem, deciding for a set X of vertices whether it is contained in any minimal hitting set. We show that this is one of the first natural problems to be W[3]-complete. We give an algorithm for the extension problem running in time O(m(vertical bar X vertical bar+1) n) and prove a SETH-lower bound showing that this is close to optimal. We apply our enumeration method to the discovery problem of minimal unique column combinations from data profiling. Our empirical evaluation suggests that the algorithm outperforms its worst-case guarantees on hypergraphs stemming from real-world databases.
KW  - Data profiling
KW  - Enumeration algorithm
KW  - Minimal hitting set
KW  - Transversal hypergraph
KW  - Unique column combination
KW  - W[3]-Completeness
Y1  - 2022
U6  - https://doi.org/10.1016/j.jcss.2021.10.002
SN  - 0022-0000
SN  - 1090-2724
VL  - 124
SP  - 192
EP  - 213
PB  - Elsevier
CY  - San Diego
ER  - 
TY  - JOUR
A1  - Schmidl, Sebastian
A1  - Papenbrock, Thorsten
T1  - Efficient distributed discovery of bidirectional order dependencies
JF  - The VLDB journal
N2  - Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets.
KW  - Bidirectional order dependencies
KW  - Distributed computing
KW  - Actor
KW  - programming
KW  - Parallelization
KW  - Data profiling
KW  - Dependency discovery
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00683-4
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 49
EP  - 74
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  - 
TY  - JOUR
A1  - Koßmann, Jan
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - Data dependencies for query optimization
BT  - a survey
JF  - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment
N2  - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
KW  - Query optimization
KW  - Query execution
KW  - Data dependencies
KW  - Data profiling
KW  - Unique column combinations
KW  - Functional dependencies
KW  - Order dependencies
KW  - Inclusion dependencies
KW  - Relational data
KW  - SQL
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00676-3
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 1
EP  - 22
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  -