TY  - THES
A1  - Draisbach, Uwe
T1  - Efficient duplicate detection and the impact of transitivity
T1  - Effiziente Dublettenerkennung und der Einfluss von Transitivität
N2  - Duplicate detection describes the process of finding multiple representations of the same real-world entity in the absence of a unique identifier, and has many application areas, such as customer relationship management, genealogy and social sciences, or online shopping. Due to the increasing amount of data in recent years, the problem has become even more challenging on the one hand, but has led to a renaissance in duplicate detection research on the other hand.
This thesis examines the effects and opportunities of transitive relationships on the duplicate detection process. Transitivity implies that if record pairs ⟨ri,rj⟩ and ⟨rj,rk⟩ are classified as duplicates, then also record pair ⟨ri,rk⟩ has to be a duplicate. However, this reasoning might contradict with the pairwise classification, which is usually based on the similarity of objects. An essential property of similarity, in contrast to equivalence, is that similarity is not necessarily transitive.
First, we experimentally evaluate the effect of an increasing data volume on the threshold selection to classify whether a record pair is a duplicate or non-duplicate. Our experiments show that independently of the pair selection algorithm and the used similarity measure, selecting a suitable threshold becomes more difficult with an increasing number of records due to an increased probability of adding a false duplicate to an existing cluster. Thus, the best threshold changes with the dataset size, and a good threshold for a small (possibly sampled) dataset is not necessarily a good threshold for a larger (possibly complete) dataset. As data grows over time, earlier selected thresholds are no longer a suitable choice, and the problem becomes worse for datasets with larger clusters.
Second, we present with the Duplicate Count Strategy (DCS) and its enhancement DCS++ two alternatives to the standard Sorted Neighborhood Method (SNM) for the selection of candidate record pairs. DCS adapts SNMs window size based on the number of detected duplicates and DCS++ uses transitive dependencies to save complex comparisons for finding duplicates in larger clusters. We prove that with a proper (domain- and data-independent!) threshold, DCS++ is more efficient than SNM without loss of effectiveness.
Third, we tackle the problem of contradicting pairwise classifications. Usually, the transitive closure is used for pairwise classifications to obtain a transitively closed result set. However, the transitive closure disregards negative classifications. We present three new and several existing clustering algorithms and experimentally evaluate them on various datasets and under various algorithm configurations. The results show that the commonly used transitive closure is inferior to most other clustering algorithms, especially for the precision of results. In scenarios with larger clusters, our proposed EMCC algorithm is, together with Markov Clustering, the best performing clustering approach for duplicate detection, although its runtime is longer than Markov Clustering due to the subexponential time complexity. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
N2  - Dubletten sind mehrere Repräsentationen derselben Entität in einem Datenbestand. Diese zu identifizieren ist das Ziel der Dublettenerkennung, wobei in der Regel Paare von Datensätzen anhand von Ähnlichkeitsmaßen miteinander verglichen und unter Verwendung eines Schwellwerts als Dublette oder Nicht-Dublette klassifiziert werden. Für Dublettenerkennung existieren verschiedene Anwendungsbereiche, beispielsweise im Kundenbeziehungsmanagement, beim Onlineshopping, der Genealogie und in den Sozialwissenschaften. Der in den letzten Jahren zu beobachtende Anstieg des gespeicherten Datenvolumens erschwert die Dublettenerkennung, da die Anzahl der benötigten Vergleiche quadratisch mit der Anzahl der Datensätze wächst. Durch Verwendung eines geeigneten Paarauswahl-Algorithmus kann die Anzahl der zu vergleichenden Paare jedoch reduziert und somit die Effizienz gesteigert werden.
Die Dissertation untersucht die Auswirkungen und Möglichkeiten transitiver Beziehungen auf den Dublettenerkennungsprozess. Durch Transitivität lässt sich beispielsweise ableiten, dass aufgrund einer Klassifikation der Datensatzpaare ⟨ri,rj⟩ und ⟨rj,rk⟩ als Dublette auch die Datensätze ⟨ri,rk⟩ eine Dublette sind. Dies kann jedoch im Widerspruch zu einer paarweisen Klassifizierung stehen, denn im Unterschied zur Äquivalenz ist die Ähnlichkeit von Objekten nicht notwendigerweise transitiv.
Im ersten Teil der Dissertation wird die Auswirkung einer steigenden Datenmenge auf die Wahl des Schwellwerts zur Klassifikation von Datensatzpaaren als Dublette oder Nicht-Dublette untersucht. Die Experimente zeigen, dass unabhängig von dem gewählten Paarauswahl-Algorithmus und des gewählten Ähnlichkeitsmaßes die Wahl eines geeigneten Schwellwerts mit steigender Datensatzanzahl schwieriger wird, da die Gefahr fehlerhafter Cluster-Zuordnungen steigt. Der optimale Schwellwert eines Datensatzes variiert mit dessen Größe. So ist ein guter Schwellwert für einen kleinen Datensatz (oder eine Stichprobe) nicht notwendigerweise ein guter Schwellwert für einen größeren (ggf. vollständigen) Datensatz. Steigt die Datensatzgröße im Lauf der Zeit an, so muss ein einmal gewählter Schwellwert ggf. nachjustiert werden. Aufgrund der Transitivität ist dies insbesondere bei Datensätzen mit größeren Clustern relevant.
Der zweite Teil der Dissertation beschäftigt sich mit Algorithmen zur Auswahl geeigneter Datensatz-Paare für die Klassifikation. Basierend auf der Sorted Neighborhood Method (SNM) werden mit der Duplicate Count Strategy (DCS) und ihrer Erweiterung DCS++ zwei neue Algorithmen vorgestellt. DCS adaptiert die Fenstergröße in Abhängigkeit der Anzahl gefundener Dubletten und DCS++ verwendet zudem die transitive Abhängigkeit, um kostspielige Vergleiche einzusparen und trotzdem größere Cluster von Dubletten zu identifizieren. Weiterhin wird bewiesen, dass mit einem geeigneten Schwellwert DCS++ ohne Einbußen bei der Effektivität effizienter als die Sorted Neighborhood Method ist.
Der dritte und letzte Teil der Arbeit beschäftigt sich mit dem Problem widersprüchlicher paarweiser Klassifikationen. In vielen Anwendungsfällen wird die Transitive Hülle zur Erzeugung konsistenter Cluster verwendet, wobei hierbei paarweise Klassifikationen als Nicht-Dublette missachtet werden. Es werden drei neue und mehrere existierende Cluster-Algorithmen vorgestellt und experimentell mit verschiedenen Datensätzen und Konfigurationen evaluiert. Die Ergebnisse zeigen, dass die Transitive Hülle den meisten anderen Clustering-Algorithmen insbesondere bei der Precision, definiert als Anteil echter Dubletten an der Gesamtzahl klassifizierter Dubletten, unterlegen ist. In Anwendungsfällen mit größeren Clustern ist der vorgeschlagene EMCC-Algorithmus trotz seiner subexponentiellen Laufzeit zusammen mit dem Markov-Clustering der beste Clustering-Ansatz für die Dublettenerkennung. EMCC übertrifft Markov Clustering insbesondere hinsichtlich der Precision der Ergebnisse und hat zusätzlich den Vorteil, dass dieser auch ohne Ähnlichkeitswerte eingesetzt werden kann.
KW  - Datenqualität
KW  - Datenintegration
KW  - Dubletten
KW  - Duplikaterkennung
KW  - data quality
KW  - data integration
KW  - duplicate detection
KW  - deduplication
KW  - entity resolution
Y1  - 2022
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-572140
ER  - 
TY  - THES
A1  - Shaabani, Nuhad
T1  - On discovering and incrementally updating inclusion dependencies
N2  - In today's world, many applications produce large amounts of data at an enormous rate. Analyzing such datasets for metadata is indispensable for effectively understanding, storing, querying, manipulating, and mining them. Metadata summarizes technical properties of a dataset which rang from basic statistics to complex structures describing data dependencies. One type of dependencies is inclusion dependency (IND), which expresses subset-relationships between attributes of datasets. Therefore, inclusion dependencies are important for many data management applications in terms of data integration, query optimization, schema redesign, or integrity checking. So, the discovery of inclusion dependencies in unknown or legacy datasets is at the core of any data profiling effort.
	
For exhaustively detecting all INDs in large datasets, we developed S-indd++, a new algorithm that eliminates the shortcomings of existing IND-detection algorithms and significantly outperforms them. S-indd++ is based on a novel concept for the attribute clustering for efficiently deriving INDs. Inferring INDs from our attribute clustering eliminates all redundant operations caused by other algorithms. S-indd++ is also based on a novel partitioning strategy that enables discording a large number of candidates in early phases of the discovering process. Moreover, S-indd++ does not require to fit a partition into the main memory--this is a highly appreciable property in the face of ever-growing datasets. S-indd++ reduces up to 50% of the runtime of the state-of-the-art approach.
	
None of the approach for discovering INDs is appropriate for the application on dynamic datasets; they can not update the INDs after an update of the dataset without reprocessing it entirely. To this end, we developed the first approach for incrementally updating INDs in frequently changing datasets. We achieved that by reducing the problem of incrementally updating INDs to the incrementally updating the attribute clustering from which all INDs are efficiently derivable. We realized the update of the clusters by designing new operations to be applied to the clusters after every data update. The incremental update of INDs reduces the time of the complete rediscovery by up to 99.999%.   
	
All existing algorithms for discovering n-ary INDs are based on the principle of candidate generation--they generate candidates and test their validity in the given data instance. The major disadvantage of this technique is the exponentially growing number of database accesses in terms of SQL queries required for validation. We devised Mind2, the first approach for discovering n-ary INDs without candidate generation. Mind2 is based on a new mathematical framework developed in this thesis for computing the maximum INDs from which all other n-ary INDs are derivable. The experiments showed that Mind2 is significantly more scalable and effective than hypergraph-based algorithms.
N2  - Viele Anwendungen produzieren mit schnellem Tempo große Datenmengen. Die Profilierung solcher Datenmengen nach ihren Metadaten ist unabdingbar für ihre effektive Verwaltung und ihre Analyse. Metadaten fassen technische Eigenschaften einer Datenmenge zusammen, welche von einfachen Statistiken bis komplexe und Datenabhängigkeiten beschreibende Strukturen umfassen. Eine Form solcher Abhängigkeiten sind Inklusionsabhängigkeiten (INDs), die Teilmengenbeziehungen zwischen Attributen der Datenmengen ausdrücken. Dies macht INDs wichtig für viele Anwendungen wie Datenintegration, Anfragenoptimierung,  Schemaentwurf und Integritätsprüfung. Somit ist die Entdeckung von INDs in unbekannten Datenmengen eine zentrale Aufgabe der Datenprofilierung.  
	
Ich entwickelte einen neuen Algorithmus namens S-indd++ für die IND-Entdeckung in großen Datenmengen. S-indd++ beseitigt die Defizite existierender Algorithmen für die IND-Entdeckung und somit ist er performanter. S-indd++ berechnet INDs sehr effizient basierend auf einem neuen Clustering der Attribute.  S-indd++ wendet auch eine neue Partitionierungsmethode an, die das Verwerfen einer großen Anzahl von Kandidaten in früheren Phasen des Entdeckungsprozesses ermöglicht. Außerdem setzt S-indd++ nicht voraus, dass eine Datenpartition komplett in den Hauptspeicher passen muss. S-indd++ reduziert die Laufzeit der IND-Entdeckung um bis 50 %.

Keiner der IND-Entdeckungsalgorithmen ist geeignet für die Anwendung auf dynamischen Daten. Zu diesem Zweck entwickelte ich das erste Verfahren für das inkrementelle Update von INDs in häufig geänderten Daten. Ich erreichte dies bei der Reduzierung des Problems des inkrementellen Updates von INDs auf dem inkrementellen Update des Attribute-Clustering, von dem INDs effizient ableitbar sind. Ich realisierte das Update der Cluster beim Entwurf von neuen Operationen, die auf den Clustern nach jedem Update der Daten angewendet werden. Das inkrementelle Update von INDs reduziert die Zeit der statischen IND-Entdeckung um bis 99,999 %.
	
Alle vorhandenen Algorithmen für die n-ary-IND-Entdeckung basieren auf dem Prinzip der Kandidatengenerierung. Der Hauptnachteil dieser Methode ist die exponentiell wachsende Anzahl der SQL-Anfragen, die für die Validierung der Kandidaten nötig sind. Zu diesem Zweck entwickelte ich Mind2, den ersten Algorithmus für n-ary-IND-Entdeckung ohne Kandidatengenerierung. Mind2 basiert auf  einem neuen mathematischen Framework für die Berechnung der maximalen INDs, von denen alle anderen n-ary-INDs ableitbar sind. Die Experimente zeigten, dass Mind2 wesentlich skalierbarer und leistungsfähiger ist als die auf Hypergraphen basierenden Algorithmen.
T2  - Beitrag zur Entdeckung und inkrementellen Aktualisierung von Inklusionsabhängigkeiten
KW  - Inclusion Dependency
KW  - Data Profiling
KW  - Data Mining
KW  - Algorithms
KW  - Inclusion Dependency Discovery
KW  - Incrementally Inclusion Dependencies Discovery
KW  - Metadata Discovery
KW  - S-indd++
KW  - Mind2
KW  - Change Data Capture
KW  - Incremental Discovery
KW  - Big Data
KW  - Data Integration
KW  - Foreign Keys
KW  - Dynamic Data
KW  - Foreign Keys Discovery
KW  - Data Profiling
KW  - Data Mining
KW  - Algorithmen
KW  - Inklusionsabhängigkeiten
KW  - Inklusionsabhängigkeiten Entdeckung
KW  - Datenintegration
KW  - Metadaten Entdeckung
Y1  - 2020
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-471862
ER  - 
TY  - THES
A1  - Robaina Estevez, Semidan
T1  - Context-specific metabolic predictions
T1  - Kontextspezifische metabolische Vorhersagen
BT  - computational methods and applications
BT  - Berechnungsmethoden und Anwendungen
N2  - All life-sustaining processes are ultimately driven by thousands of biochemical reactions occurring in the cells: the metabolism. These reactions form an intricate network which produces all required chemical compounds, i.e., metabolites, from a set of input molecules. Cells regulate the activity through metabolic reactions in a context-specific way; only reactions that are required in a cellular context, e.g., cell type, developmental stage or environmental condition, are usually active, while the rest remain inactive. The context-specificity of metabolism can be captured by several kinds of experimental data, such as by gene and protein expression or metabolite profiles. In addition, these context-specific data can be assimilated into computational models of metabolism, which then provide context-specific metabolic predictions. 
This thesis is composed of three individual studies focussing on context-specific experimental data integration into computational models of metabolism. The first study presents an optimization-based method to obtain context-specific metabolic predictions, and offers the advantage of being fully automated, i.e., free of user defined parameters. The second study explores the effects of alternative optimal solutions arising during the generation of context-specific metabolic predictions. These alternative optimal solutions are metabolic model predictions that represent equally well the integrated data, but that can markedly differ. This study proposes algorithms to analyze the space of alternative solutions, as well as some ways to cope with their impact in the predictions. 
Finally, the third study investigates the metabolic specialization of the guard cells of the plant Arabidopsis thaliana, and compares it with that of a different cell type, the mesophyll cells. To this end, the computational methods developed in this thesis are applied to obtain metabolic predictions specific to guard cell and mesophyll cells. These cell-specific predictions are then compared to explore the differences in metabolic activity between the two cell types. In addition, the effects of alternative optima are taken into consideration when comparing the two cell types. The computational results indicate a major reorganization of the primary metabolism in guard cells. These results are supported by an independent 13C labelling experiment.
N2  - Alle lebenserhaltenden Prozesse werden durch tausende biochemische Reaktionen in der Zelle bestimmt, welche den Metabolismus charakterisieren. Diese Reaktionen bilden ein komplexes Netzwerk, welches alle notwendigen chemischen Verbindungen, die sogenannten Metabolite, aus einer bestimmten Menge an Ausgangsmolekülen produziert Zellen regulieren ihren Stoffwechsel kontextspezifisch, dies bedeutet, dass nur Reaktionen die in einem zellulären Kontext, zum Beispiel Zelltyp, Entwicklungsstadium oder verschiedenen Umwelteinflüssen, benötigt werden auch tatsächlich aktiv sind. Die übrigen Reaktionen werden als inaktiv betrachtet. Die Kontextspezifität des Metabolismus kann durch verschiedene experimentelle Daten, wie Gen- und Proteinexpressionen oder Metabolitprofile erfasst werden. Zusätzlich können diese Daten in Computersimulationen des Metabolismus integriert werden, um kontextspezifische (metabolische) Vorhersagen zu treffen.
Diese Doktorarbeit besteht aus drei unabhängigen Studien, welche die Integration von kontextspezifischen experimentellen Daten in Computersimulationen des Metabolismus thematisieren. Die erste Studie beschreibt ein Konzept, basierend auf einem mathematischen Optimierungsproblem, welches es erlaubt kontextspezifische, metabolische Vorhersagen zu treffen. Dabei bietet diese vollautomatische Methode den Vorteil vom Nutzer unabhängige Parameter, zu verwenden. Die zweite Studie untersucht den Einfluss von alternativen optimalen Lösungen, welche bei kontextspezifischen metabolischen Vorhersagen generiert werden. Diese alternativen Lösungen stellen metabolische Modellvorhersagen da, welche die integrierten Daten gleichgut wiederspiegeln, sich aber grundlegend voneinander unterscheiden können. Diese Studie zeigt verschiedene Ansätze alternativen Lösungen zu analysieren und ihren Einfluss auf die Vorhersagen zu berücksichtigen. 
Schlussendlich, untersucht die dritte Studie die metabolische Spezialisierung der Schließzellen in Arabidopsis thaliana und vergleicht diese mit einer weiteren Zellart, den Mesophyllzellen. Zu diesem Zweck wurden die in dieser Doktorarbeit vorgestellten Methoden angewandt um metabolische Vorhersagen speziell für Schließzellen und Mesophyllzellen zu erhalten. Anschließend wurden die zellspezifischen Vorhersagen  auf Unterschiede in der metabolischen Aktivität der Zelltypen, unter Berücksichtigung des Effekt von alternativen Optima, untersucht. Die Ergebnisse der Simulationen legen eine grundlegende Neuorganisation des Primärmetabolismus in Schließzellen verglichen mit Mesophyllzellen nahe. Diese Ergebnisse werden durch unabhängige 13C  markierungs Experimente bestätigt.
KW  - systems biology
KW  - bioinformatics
KW  - metabolic networks
KW  - constraint-based modeling
KW  - data integration
KW  - Systemsbiologie
KW  - Bioinformatik
KW  - Stoffwechselnetze
KW  - Constraint-basierte Modellierung
KW  - Datenintegration
Y1  - 2017
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-401365
ER  - 
TY  - THES
A1  - Heise, Arvid
T1  - Data cleansing and integration operators for a parallel data analytics platform
T1  - Datenreinigungs- und Integrationsoperatoren für ein
paralles Datenanalyseframework
N2  - The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.

We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.

As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.
N2  - Die Datenqualität von Realweltdaten muss ständig überwacht und gewartet werden, damit Organisationen und Individuen ihre Daten verlässlich nutzen können. Besonders Datenintegrationsprojekte leiden unter schlechter Datenqualität in den Quelldaten und benötigen somit mehr Zeit und Geld. Kommerzielle Produkte und Forschungsprototypen helfen Nutzern die Qualität in einzelnen und kombinierten Datensätzen zu verbessern. Die Systeme können in selbständige Systeme und Erweiterungen von bestehenden Datenbankmanagementsystemen (DBMS) unterteilt werden. Auf der einen Seite interagieren selbständige Systeme nicht gut mit DBMS und brauchen zeitaufwändigen Datenimport und -export. Auf der anderen Seite sind die DBMS Erweiterungen häufig durch das unterliegende System limitiert und unterstützen nicht die gesamte Bandbreite an Datenreinigungs- und -integrationsaufgaben.

Wir überwinden beide Limitationen, indem wir eine Menge von häufig benötigten Datenreinigungs- und Datenintegrationsoperatoren direkt in der parallelen Datenanalyseplattform Stratosphere implementieren. Wir definieren die Semantik der Operatoren, präsentieren deren parallele Implementierung und entwickeln Optimierungstechniken für die einzelnen und mehrere Operatoren. Nutzer können deklarative Anfragen in unserer Anfragesprache METEOR mit unseren neuen Operatoren formulieren, um die Datenqualität von einzelnen Datensätzen zu erhöhen, oder um sie zu größeren Datensätzen zu integrieren. Durch die Integration der Operatoren in die Hochsprachenschicht von Stratosphere können Nutzer Datenreinigungsoperatoren einfach mit Operatoren aus anderen Domänen wie Informationsextraktion zu komplexen Datenflüssen kombinieren. Da Stratosphere Operatoren durch generische Beschreibungen in den Optimierer integriert werden, ist es für den Optimierer sogar möglich Operatoren unterschiedlicher Domänen zu vertauschen, um besseren Anfrageplänen zu ermitteln.

Für eine Fallstudie haben wir Teile des großen Datenintegrationsprojektes GovWILD auf Stratosphere mit den neuen Operatoren nachimplementiert und zeigen, dass unsere Anfragen signifikant schneller laufen als die originalen GovWILD Anfragen, die sich auf relationale Operatoren verlassen. Die Evaluation zeigt, dass unsere Operatoren gut auf bis zu 100 Kernen skalieren, sodass sogar größere Datensätze effizient verarbeitet werden können, indem die Anfragen auf mehr Maschinen ausgeführt werden. Schließlich sind unsere Skripte erheblich kürzer als die originalen GovWILD Skripte, was in besserer Wartbarkeit unserer Skripte resultiert.
KW  - data
KW  - cleansing
KW  - holistic
KW  - parallel
KW  - map reduce
KW  - Datenreinigung
KW  - Datenintegration
KW  - ganzheitlich
KW  - parallel
KW  - map reduce
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-77100
ER  - 
TY  - THES
A1  - Bauckmann, Jana
T1  - Dependency discovery for data integration
T1  - Erkennen von Datenabhängigkeiten zur Datenintegration
N2  - Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains.
N2  - Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt.
KW  - Datenabhängigkeiten-Entdeckung
KW  - Datenintegration
KW  - Schema-Entdeckung
KW  - Link-Entdeckung
KW  - Inklusionsabhängigkeit
KW  - dependency discovery
KW  - data integration
KW  - schema discovery
KW  - link discovery
KW  - inclusion dependency
Y1  - 2013
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-66645
ER  - 
TY  - BOOK
A1  - Albrecht, Alexander
A1  - Naumann, Felix
T1  - Understanding cryptic schemata in large extract-transform-load systems
N2  - Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema.
N2  - Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 60 
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Datenintegration
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Data Integration
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-61257
SN  - 978-3-86956-201-8
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Draisbach, Uwe
A1  - Naumann, Felix
A1  - Szott, Sascha
A1  - Wonneberg, Oliver
T1  - Adaptive windows for duplicate detection
N2  - Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
N2  - Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 49 
KW  - Informationssysteme
KW  - Datenqualität
KW  - Datenintegration
KW  - Duplikaterkennung
KW  - Duplicate Detection
KW  - Data Quality
KW  - Data Integration
KW  - Information Systems
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-53007
SN  - 978-3-86956-143-1
SN  - 1613-5652
SN  - 2191-1665
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Leser, Ulf
A1  - Naumann, Felix
T1  - Efficient and exact computation of inclusion dependencies for data integration
N2  - Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 34 
KW  - Metadatenentdeckung
KW  - Metadatenqualität
KW  - Schemaentdeckung
KW  - Datenanalyse
KW  - Datenintegration
KW  - metadata discovery
KW  - metadata quality
KW  - schema discovery
KW  - data profiling
KW  - data integration
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-41396
SN  - 978-3-86956-048-9
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  -