TY - JOUR A1 - Koumarelas, Ioannis A1 - Papenbrock, Thorsten A1 - Naumann, Felix T1 - MDedup BT - duplicate detection with matching dependencies JF - Proceedings of the VLDB Endowment N2 - Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration. Y1 - 2020 U6 - https://doi.org/10.14778/3377369.3377379 SN - 2150-8097 VL - 13 IS - 5 SP - 712 EP - 725 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Birnick, Johann A1 - Bläsius, Thomas A1 - Friedrich, Tobias A1 - Naumann, Felix A1 - Papenbrock, Thorsten A1 - Schirneck, Friedrich Martin T1 - Hitting set enumeration with partial information for unique column combination discovery JF - Proceedings of the VLDB Endowment N2 - Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations.
In this paper, we introduce HPIValid, a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint. Y1 - 2020 U6 - https://doi.org/10.14778/3407790.3407824 SN - 2150-8097 VL - 13 IS - 11 SP - 2270 EP - 2283 PB - Association for Computing Machinery CY - [New York, NY] ER - TY - JOUR A1 - Schmidl, Sebastian A1 - Papenbrock, Thorsten T1 - Efficient distributed discovery of bidirectional order dependencies JF - The VLDB journal N2 - Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets. KW - Bidirectional order dependencies KW - Distributed computing KW - Actor KW - programming KW - Parallelization KW - Data profiling KW - Dependency discovery Y1 - 2021 U6 - https://doi.org/10.1007/s00778-021-00683-4 SN - 1066-8888 SN - 0949-877X VL - 31 IS - 1 SP - 49 EP - 74 PB - Springer CY - Berlin ; Heidelberg ; New York ER - TY - JOUR A1 - Schirmer, Philipp A1 - Papenbrock, Thorsten A1 - Koumarelas, Ioannis A1 - Naumann, Felix T1 - Efficient discovery of matching dependencies JF - ACM transactions on database systems : TODS N2 - Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB. KW - matching dependencies KW - functional dependencies KW - dependency discovery KW - data profiling KW - data matching KW - entity resolution KW - similarity measures Y1 - 2020 U6 - https://doi.org/10.1145/3392778 SN - 0362-5915 SN - 1557-4644 VL - 45 IS - 3 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Schneider, Johannes A1 - Wenig, Phillip A1 - Papenbrock, Thorsten T1 - Distributed detection of sequential anomalies in univariate time series JF - The VLDB journal : the international journal on very large data bases N2 - The automated detection of sequential anomalies in time series is an essential task for many applications, such as the monitoring of technical systems, fraud detection in high-frequency trading, or the early detection of disease symptoms. All these applications require the detection to find all sequential anomalies possibly fast on potentially very large time series. In other words, the detection needs to be effective, efficient and scalable w.r.t. the input size. Series2Graph is an effective solution based on graph embeddings that are robust against re-occurring anomalies and can discover sequential anomalies of arbitrary length and works without training data. Yet, Series2Graph is no t scalable due to its single-threaded approach; it cannot, in particular, process arbitrarily large sequences due to the memory constraints of a single machine. In this paper, we propose our distributed anomaly detection system, short DADS, which is an efficient and scalable adaptation of Series2Graph. Based on the actor programming model, DADS distributes the input time sequence, intermediate state and the computation to all processors of a cluster in a way that minimizes communication costs and synchronization barriers. Our evaluation shows that DADS is orders of magnitude faster than S2G, scales almost linearly with the number of processors in the cluster and can process much larger input sequences due to its scale-out property. KW - Distributed programming KW - Sequential anomaly KW - Actor model KW - Data mining KW - Time series Y1 - 2021 U6 - https://doi.org/10.1007/s00778-021-00657-6 SN - 1066-8888 SN - 0949-877X VL - 30 IS - 4 SP - 579 EP - 602 PB - Springer CY - Berlin ER - TY - THES A1 - Papenbrock, Thorsten T1 - Data profiling - efficient discovery of dependencies T1 - Profilerstellung für Daten - Effiziente Entdeckung von Abhängigkeiten N2 - Data profiling is the computer science discipline of analyzing a given dataset for its metadata. The types of metadata range from basic statistics, such as tuple counts, column aggregations, and value distributions, to much more complex structures, in particular inclusion dependencies (INDs), unique column combinations (UCCs), and functional dependencies (FDs). If present, these statistics and structures serve to efficiently store, query, change, and understand the data. Most datasets, however, do not provide their metadata explicitly so that data scientists need to profile them. While basic statistics are relatively easy to calculate, more complex structures present difficult, mostly NP-complete discovery tasks; even with good domain knowledge, it is hardly possible to detect them manually. Therefore, various profiling algorithms have been developed to automate the discovery. None of them, however, can process datasets of typical real-world size, because their resource consumptions and/or execution times exceed effective limits. In this thesis, we propose novel profiling algorithms that automatically discover the three most popular types of complex metadata, namely INDs, UCCs, and FDs, which all describe different kinds of key dependencies. The task is to extract all valid occurrences from a given relational instance. The three algorithms build upon known techniques from related work and complement them with algorithmic paradigms, such as divide & conquer, hybrid search, progressivity, memory sensitivity, parallelization, and additional pruning to greatly improve upon current limitations. Our experiments show that the proposed algorithms are orders of magnitude faster than related work. They are, in particular, now able to process datasets of real-world, i.e., multiple gigabytes size with reasonable memory and time consumption. Due to the importance of data profiling in practice, industry has built various profiling tools to support data scientists in their quest for metadata. These tools provide good support for basic statistics and they are also able to validate individual dependencies, but they lack real discovery features even though some fundamental discovery techniques are known for more than 15 years. To close this gap, we developed Metanome, an extensible profiling platform that incorporates not only our own algorithms but also many further algorithms from other researchers. With Metanome, we make our research accessible to all data scientists and IT-professionals that are tasked with data profiling. Besides the actual metadata discovery, the platform also offers support for the ranking and visualization of metadata result sets. Being able to discover the entire set of syntactically valid metadata naturally introduces the subsequent task of extracting only the semantically meaningful parts. This is challenge, because the complete metadata results are surprisingly large (sometimes larger than the datasets itself) and judging their use case dependent semantic relevance is difficult. To show that the completeness of these metadata sets is extremely valuable for their usage, we finally exemplify the efficient processing and effective assessment of functional dependencies for the use case of schema normalization. N2 - Data Profiling ist eine Disziplin der Informatik, die sich mit der Analyse von Datensätzen auf deren Metadaten beschäftigt. Die verschiedenen Typen von Metadaten reichen von einfachen Statistiken wie Tupelzahlen, Spaltenaggregationen und Wertverteilungen bis hin zu weit komplexeren Strukturen, insbesondere Inklusionsabhängigkeiten (INDs), eindeutige Spaltenkombinationen (UCCs) und funktionale Abhängigkeiten (FDs). Diese Statistiken und Strukturen dienen, sofern vorhanden, dazu die Daten effizient zu speichern, zu lesen, zu ändern und zu verstehen. Die meisten Datensätze stellen ihre Metadaten aber nicht explizit zur Verfügung, so dass Informatiker sie mittels Data Profiling bestimmen müssen. Während einfache Statistiken noch relativ schnell zu berechnen sind, stellen die komplexen Strukturen schwere, zumeist NP-vollständige Entdeckungsaufgaben dar. Es ist daher auch mit gutem Domänenwissen in der Regel nicht möglich sie manuell zu entdecken. Aus diesem Grund wurden verschiedenste Profiling Algorithmen entwickelt, die die Entdeckung automatisieren. Keiner dieser Algorithmen kann allerdings Datensätze von heutzutage typischer Größe verarbeiten, weil entweder der Ressourcenverbrauch oder die Rechenzeit effektive Grenzen überschreiten. In dieser Arbeit stellen wir neuartige Profiling Algorithmen vor, die automatisch die drei populärsten Typen komplexer Metadaten entdecken können, nämlich INDs, UCCs, und FDs, die alle unterschiedliche Formen von Schlüssel-Abhängigkeiten beschreiben. Die Aufgabe dieser Algorithmen ist es alle gültigen Vorkommen der drei Metadaten-Typen aus einer gegebenen relationalen Instanz zu extrahieren. Sie nutzen dazu bekannte Entdeckungstechniken aus verwandten Arbeiten und ergänzen diese um algorithmische Paradigmen wie Teile-und-Herrsche, hybrides Suchen, Progressivität, Speichersensibilität, Parallelisierung und zusätzliche Streichungsregeln. Unsere Experimente zeigen, dass die vorgeschlagenen Algorithmen mit den neuen Techniken nicht nur um Größenordnungen schneller sind als alle verwandten Arbeiten, sie erweitern auch aktuelle Beschränkungen deutlich. Sie können insbesondere nun Datensätze realer Größe, d.h. mehrerer Gigabyte Größe mit vernünftigem Speicher- und Zeitverbrauch verarbeiten. Aufgrund der praktischen Relevanz von Data Profiling hat die Industrie verschiedene Profiling Werkzeuge entwickelt, die Informatiker in ihrer Suche nach Metadaten unterstützen sollen. Diese Werkzeuge bieten eine gute Unterstützung für die Berechnung einfacher Statistiken. Sie sind auch in der Lage einzelne Abhängigkeiten zu validieren, allerdings mangelt es ihnen an Funktionen zur echten Entdeckung von Metadaten, obwohl grundlegende Entdeckungstechniken schon mehr als 15 Jahre bekannt sind. Um diese Lücke zu schließen haben wir Metanome entwickelt, eine erweiterbare Profiling Plattform, die nicht nur unsere eigenen Algorithmen sondern auch viele weitere Algorithmen anderer Forscher integriert. Mit Metanome machen wir unsere Forschungsergebnisse für alle Informatiker und IT-Fachkräfte zugänglich, die ein modernes Data Profiling Werkzeug benötigen. Neben der tatsächlichen Metadaten-Entdeckung bietet die Plattform zusätzlich Unterstützung bei der Bewertung und Visualisierung gefundener Metadaten. Alle syntaktisch korrekten Metadaten effizient finden zu können führt natürlicherweise zur Folgeaufgabe daraus nur die semantisch bedeutsamen Teile zu extrahieren. Das ist eine Herausforderung, weil zum einen die Mengen der gefundenen Metadaten überraschenderweise groß sind (manchmal größer als der untersuchte Datensatz selbst) und zum anderen die Entscheidung über die Anwendungsfall-spezifische semantische Relevanz einzelner Metadaten-Aussagen schwierig ist. Um zu zeigen, dass die Vollständigkeit der Metadaten sehr wertvoll für ihre Nutzung ist, veranschaulichen wir die effiziente Verarbeitung und effektive Bewertung von funktionalen Abhängigkeiten am Anwendungsfall Schema Normalisierung. KW - data profiling KW - functional dependency KW - unique column combination KW - inclusion dependency KW - dependency KW - metanome KW - metadata KW - discovery KW - hybrid KW - divide-and-conquer KW - Profilerstellung für Daten KW - funktionale Abhängigkeit KW - eindeutige Spaltenkombination KW - Inklusionsabhängigkeit KW - Abhängigkeit KW - Metanome KW - Metadaten KW - Entdeckung KW - Hybrid KW - Teile und Herrsche Y1 - 2017 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-406705 ER - TY - BOOK A1 - Abedjan, Ziawasch A1 - Golab, Lukasz A1 - Naumann, Felix A1 - Papenbrock, Thorsten T1 - Data Profiling T3 - Synthesis lectures on data management, 52 Y1 - 2019 SN - 978-1-68173-446-0 PB - Morgan & Claypool Publishers CY - San Rafael ER - TY - JOUR A1 - Koßmann, Jan A1 - Papenbrock, Thorsten A1 - Naumann, Felix T1 - Data dependencies for query optimization BT - a survey JF - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment N2 - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions. KW - Query optimization KW - Query execution KW - Data dependencies KW - Data profiling KW - Unique column combinations KW - Functional dependencies KW - Order dependencies KW - Inclusion dependencies KW - Relational data KW - SQL Y1 - 2021 U6 - https://doi.org/10.1007/s00778-021-00676-3 SN - 1066-8888 SN - 0949-877X VL - 31 IS - 1 SP - 1 EP - 22 PB - Springer CY - Berlin ; Heidelberg ; New York ER -