TY  - THES
A1  - Draisbach, Uwe
T1  - Efficient duplicate detection and the impact of transitivity
T1  - Effiziente Dublettenerkennung und der Einfluss von Transitivität
N2  - Duplicate detection describes the process of finding multiple representations of the same real-world entity in the absence of a unique identifier, and has many application areas, such as customer relationship management, genealogy and social sciences, or online shopping. Due to the increasing amount of data in recent years, the problem has become even more challenging on the one hand, but has led to a renaissance in duplicate detection research on the other hand.
This thesis examines the effects and opportunities of transitive relationships on the duplicate detection process. Transitivity implies that if record pairs ⟨ri,rj⟩ and ⟨rj,rk⟩ are classified as duplicates, then also record pair ⟨ri,rk⟩ has to be a duplicate. However, this reasoning might contradict with the pairwise classification, which is usually based on the similarity of objects. An essential property of similarity, in contrast to equivalence, is that similarity is not necessarily transitive.
First, we experimentally evaluate the effect of an increasing data volume on the threshold selection to classify whether a record pair is a duplicate or non-duplicate. Our experiments show that independently of the pair selection algorithm and the used similarity measure, selecting a suitable threshold becomes more difficult with an increasing number of records due to an increased probability of adding a false duplicate to an existing cluster. Thus, the best threshold changes with the dataset size, and a good threshold for a small (possibly sampled) dataset is not necessarily a good threshold for a larger (possibly complete) dataset. As data grows over time, earlier selected thresholds are no longer a suitable choice, and the problem becomes worse for datasets with larger clusters.
Second, we present with the Duplicate Count Strategy (DCS) and its enhancement DCS++ two alternatives to the standard Sorted Neighborhood Method (SNM) for the selection of candidate record pairs. DCS adapts SNMs window size based on the number of detected duplicates and DCS++ uses transitive dependencies to save complex comparisons for finding duplicates in larger clusters. We prove that with a proper (domain- and data-independent!) threshold, DCS++ is more efficient than SNM without loss of effectiveness.
Third, we tackle the problem of contradicting pairwise classifications. Usually, the transitive closure is used for pairwise classifications to obtain a transitively closed result set. However, the transitive closure disregards negative classifications. We present three new and several existing clustering algorithms and experimentally evaluate them on various datasets and under various algorithm configurations. The results show that the commonly used transitive closure is inferior to most other clustering algorithms, especially for the precision of results. In scenarios with larger clusters, our proposed EMCC algorithm is, together with Markov Clustering, the best performing clustering approach for duplicate detection, although its runtime is longer than Markov Clustering due to the subexponential time complexity. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
N2  - Dubletten sind mehrere Repräsentationen derselben Entität in einem Datenbestand. Diese zu identifizieren ist das Ziel der Dublettenerkennung, wobei in der Regel Paare von Datensätzen anhand von Ähnlichkeitsmaßen miteinander verglichen und unter Verwendung eines Schwellwerts als Dublette oder Nicht-Dublette klassifiziert werden. Für Dublettenerkennung existieren verschiedene Anwendungsbereiche, beispielsweise im Kundenbeziehungsmanagement, beim Onlineshopping, der Genealogie und in den Sozialwissenschaften. Der in den letzten Jahren zu beobachtende Anstieg des gespeicherten Datenvolumens erschwert die Dublettenerkennung, da die Anzahl der benötigten Vergleiche quadratisch mit der Anzahl der Datensätze wächst. Durch Verwendung eines geeigneten Paarauswahl-Algorithmus kann die Anzahl der zu vergleichenden Paare jedoch reduziert und somit die Effizienz gesteigert werden.
Die Dissertation untersucht die Auswirkungen und Möglichkeiten transitiver Beziehungen auf den Dublettenerkennungsprozess. Durch Transitivität lässt sich beispielsweise ableiten, dass aufgrund einer Klassifikation der Datensatzpaare ⟨ri,rj⟩ und ⟨rj,rk⟩ als Dublette auch die Datensätze ⟨ri,rk⟩ eine Dublette sind. Dies kann jedoch im Widerspruch zu einer paarweisen Klassifizierung stehen, denn im Unterschied zur Äquivalenz ist die Ähnlichkeit von Objekten nicht notwendigerweise transitiv.
Im ersten Teil der Dissertation wird die Auswirkung einer steigenden Datenmenge auf die Wahl des Schwellwerts zur Klassifikation von Datensatzpaaren als Dublette oder Nicht-Dublette untersucht. Die Experimente zeigen, dass unabhängig von dem gewählten Paarauswahl-Algorithmus und des gewählten Ähnlichkeitsmaßes die Wahl eines geeigneten Schwellwerts mit steigender Datensatzanzahl schwieriger wird, da die Gefahr fehlerhafter Cluster-Zuordnungen steigt. Der optimale Schwellwert eines Datensatzes variiert mit dessen Größe. So ist ein guter Schwellwert für einen kleinen Datensatz (oder eine Stichprobe) nicht notwendigerweise ein guter Schwellwert für einen größeren (ggf. vollständigen) Datensatz. Steigt die Datensatzgröße im Lauf der Zeit an, so muss ein einmal gewählter Schwellwert ggf. nachjustiert werden. Aufgrund der Transitivität ist dies insbesondere bei Datensätzen mit größeren Clustern relevant.
Der zweite Teil der Dissertation beschäftigt sich mit Algorithmen zur Auswahl geeigneter Datensatz-Paare für die Klassifikation. Basierend auf der Sorted Neighborhood Method (SNM) werden mit der Duplicate Count Strategy (DCS) und ihrer Erweiterung DCS++ zwei neue Algorithmen vorgestellt. DCS adaptiert die Fenstergröße in Abhängigkeit der Anzahl gefundener Dubletten und DCS++ verwendet zudem die transitive Abhängigkeit, um kostspielige Vergleiche einzusparen und trotzdem größere Cluster von Dubletten zu identifizieren. Weiterhin wird bewiesen, dass mit einem geeigneten Schwellwert DCS++ ohne Einbußen bei der Effektivität effizienter als die Sorted Neighborhood Method ist.
Der dritte und letzte Teil der Arbeit beschäftigt sich mit dem Problem widersprüchlicher paarweiser Klassifikationen. In vielen Anwendungsfällen wird die Transitive Hülle zur Erzeugung konsistenter Cluster verwendet, wobei hierbei paarweise Klassifikationen als Nicht-Dublette missachtet werden. Es werden drei neue und mehrere existierende Cluster-Algorithmen vorgestellt und experimentell mit verschiedenen Datensätzen und Konfigurationen evaluiert. Die Ergebnisse zeigen, dass die Transitive Hülle den meisten anderen Clustering-Algorithmen insbesondere bei der Precision, definiert als Anteil echter Dubletten an der Gesamtzahl klassifizierter Dubletten, unterlegen ist. In Anwendungsfällen mit größeren Clustern ist der vorgeschlagene EMCC-Algorithmus trotz seiner subexponentiellen Laufzeit zusammen mit dem Markov-Clustering der beste Clustering-Ansatz für die Dublettenerkennung. EMCC übertrifft Markov Clustering insbesondere hinsichtlich der Precision der Ergebnisse und hat zusätzlich den Vorteil, dass dieser auch ohne Ähnlichkeitswerte eingesetzt werden kann.
KW  - Datenqualität
KW  - Datenintegration
KW  - Dubletten
KW  - Duplikaterkennung
KW  - data quality
KW  - data integration
KW  - duplicate detection
KW  - deduplication
KW  - entity resolution
Y1  - 2022
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-572140
ER  - 
TY  - JOUR
A1  - Borchert, Florian
A1  - Mock, Andreas
A1  - Tomczak, Aurelie
A1  - Hügel, Jonas
A1  - Alkarkoukly, Samer
A1  - Knurr, Alexander
A1  - Volckmar, Anna-Lena
A1  - Stenzinger, Albrecht
A1  - Schirmacher, Peter
A1  - Debus, Jürgen
A1  - Jäger, Dirk
A1  - Longerich, Thomas
A1  - Fröhling, Stefan
A1  - Eils, Roland
A1  - Bougatf, Nina
A1  - Sax, Ulrich
A1  - Schapranow, Matthieu-Patrick
T1  - Knowledge bases and software support for variant interpretation in precision oncology
JF  - Briefings in bioinformatics
N2  - Precision oncology is a rapidly evolving interdisciplinary medical specialty. Comprehensive cancer panels are becoming increasingly available at pathology departments worldwide, creating the urgent need for scalable cancer variant annotation and molecularly informed treatment recommendations. A wealth of mainly academia-driven knowledge bases calls for software tools supporting the multi-step diagnostic process. We derive a comprehensive list of knowledge bases relevant for variant interpretation by a review of existing literature followed by a survey among medical experts from university hospitals in Germany. In addition, we review cancer variant interpretation tools, which integrate multiple knowledge bases. We categorize the knowledge bases along the diagnostic process in precision oncology and analyze programmatic access options as well as the integration of knowledge bases into software tools. The most commonly used knowledge bases provide good programmatic access options and have been integrated into a range of software tools. For the wider set of knowledge bases, access options vary across different parts of the diagnostic process. Programmatic access is limited for information regarding clinical classifications of variants and for therapy recommendations. The main issue for databases used for biological classification of pathogenic variants and pathway context information is the lack of standardized interfaces. There is no single cancer variant interpretation tool that integrates all identified knowledge bases. Specialized tools are available and need to be further developed for different steps in the diagnostic process.
KW  - HiGHmed
KW  - personalized medicine
KW  - molecular tumor board
KW  - data integration
KW  - cancer therapy
Y1  - 2021
U6  - https://doi.org/10.1093/bib/bbab134
SN  - 1467-5463
SN  - 1477-4054
VL  - 22
IS  - 6
PB  - Oxford Univ. Press
CY  - Oxford
ER  - 
TY  - JOUR
A1  - Kaitoua, Abdulrahman
A1  - Rabl, Tilmann
A1  - Markl, Volker
T1  - A distributed data exchange engine for polystores
JF  - Information technology : methods and applications of informatics and information technology
JF  - Information technology : Methoden und innovative Anwendungen der Informatik und Informationstechnik
N2  - There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.
Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines' instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.
KW  - distributed systems
KW  - data migration
KW  - data transformation
KW  - big data
KW  - engine
KW  - data integration
Y1  - 2020
U6  - https://doi.org/10.1515/itit-2019-0037
SN  - 1611-2776
SN  - 2196-7032
VL  - 62
IS  - 3-4
SP  - 145
EP  - 156
PB  - De Gruyter
CY  - Berlin
ER  - 
TY  - GEN
A1  - Sukmana, Muhammad Ihsan Haikal
A1  - Torkura, Kennedy A.
A1  - Cheng, Feng
A1  - Meinel, Christoph
A1  - Graupner, Hendrik
T1  - Unified logging system for monitoring multiple cloud storage providers in cloud storage broker
T2  - 32ND International Conference on Information Networking (ICOIN)
N2  - With the increasing demand for personal and enterprise data storage service, Cloud Storage Broker (CSB) provides cloud storage service using multiple Cloud Service Providers (CSPs) with guaranteed Quality of Service (QoS), such as data availability and security. However monitoring cloud storage usage in multiple CSPs has become a challenge for CSB due to lack of standardized logging format for cloud services that causes each CSP to implement its own format. In this paper we propose a unified logging system that can be used by CSB to monitor cloud storage usage across multiple CSPs. We gather cloud storage log files from three different CSPs and normalise these into our proposed log format that can be used for further analysis process. We show that our work enables a coherent view suitable for data navigation, monitoring, and analytics.
KW  - Unified logging system
KW  - Cloud Service Provider
KW  - cloud monitoring
KW  - data integration
KW  - security analytics
Y1  - 2018
SN  - 978-1-5386-2290-2
U6  - https://doi.org/10.1109/ICOIN.2018.8343081
SP  - 44
EP  - 49
PB  - IEEE
CY  - New York
ER  - 
TY  - THES
A1  - Robaina Estevez, Semidan
T1  - Context-specific metabolic predictions
T1  - Kontextspezifische metabolische Vorhersagen
BT  - computational methods and applications
BT  - Berechnungsmethoden und Anwendungen
N2  - All life-sustaining processes are ultimately driven by thousands of biochemical reactions occurring in the cells: the metabolism. These reactions form an intricate network which produces all required chemical compounds, i.e., metabolites, from a set of input molecules. Cells regulate the activity through metabolic reactions in a context-specific way; only reactions that are required in a cellular context, e.g., cell type, developmental stage or environmental condition, are usually active, while the rest remain inactive. The context-specificity of metabolism can be captured by several kinds of experimental data, such as by gene and protein expression or metabolite profiles. In addition, these context-specific data can be assimilated into computational models of metabolism, which then provide context-specific metabolic predictions. 
This thesis is composed of three individual studies focussing on context-specific experimental data integration into computational models of metabolism. The first study presents an optimization-based method to obtain context-specific metabolic predictions, and offers the advantage of being fully automated, i.e., free of user defined parameters. The second study explores the effects of alternative optimal solutions arising during the generation of context-specific metabolic predictions. These alternative optimal solutions are metabolic model predictions that represent equally well the integrated data, but that can markedly differ. This study proposes algorithms to analyze the space of alternative solutions, as well as some ways to cope with their impact in the predictions. 
Finally, the third study investigates the metabolic specialization of the guard cells of the plant Arabidopsis thaliana, and compares it with that of a different cell type, the mesophyll cells. To this end, the computational methods developed in this thesis are applied to obtain metabolic predictions specific to guard cell and mesophyll cells. These cell-specific predictions are then compared to explore the differences in metabolic activity between the two cell types. In addition, the effects of alternative optima are taken into consideration when comparing the two cell types. The computational results indicate a major reorganization of the primary metabolism in guard cells. These results are supported by an independent 13C labelling experiment.
N2  - Alle lebenserhaltenden Prozesse werden durch tausende biochemische Reaktionen in der Zelle bestimmt, welche den Metabolismus charakterisieren. Diese Reaktionen bilden ein komplexes Netzwerk, welches alle notwendigen chemischen Verbindungen, die sogenannten Metabolite, aus einer bestimmten Menge an Ausgangsmolekülen produziert Zellen regulieren ihren Stoffwechsel kontextspezifisch, dies bedeutet, dass nur Reaktionen die in einem zellulären Kontext, zum Beispiel Zelltyp, Entwicklungsstadium oder verschiedenen Umwelteinflüssen, benötigt werden auch tatsächlich aktiv sind. Die übrigen Reaktionen werden als inaktiv betrachtet. Die Kontextspezifität des Metabolismus kann durch verschiedene experimentelle Daten, wie Gen- und Proteinexpressionen oder Metabolitprofile erfasst werden. Zusätzlich können diese Daten in Computersimulationen des Metabolismus integriert werden, um kontextspezifische (metabolische) Vorhersagen zu treffen.
Diese Doktorarbeit besteht aus drei unabhängigen Studien, welche die Integration von kontextspezifischen experimentellen Daten in Computersimulationen des Metabolismus thematisieren. Die erste Studie beschreibt ein Konzept, basierend auf einem mathematischen Optimierungsproblem, welches es erlaubt kontextspezifische, metabolische Vorhersagen zu treffen. Dabei bietet diese vollautomatische Methode den Vorteil vom Nutzer unabhängige Parameter, zu verwenden. Die zweite Studie untersucht den Einfluss von alternativen optimalen Lösungen, welche bei kontextspezifischen metabolischen Vorhersagen generiert werden. Diese alternativen Lösungen stellen metabolische Modellvorhersagen da, welche die integrierten Daten gleichgut wiederspiegeln, sich aber grundlegend voneinander unterscheiden können. Diese Studie zeigt verschiedene Ansätze alternativen Lösungen zu analysieren und ihren Einfluss auf die Vorhersagen zu berücksichtigen. 
Schlussendlich, untersucht die dritte Studie die metabolische Spezialisierung der Schließzellen in Arabidopsis thaliana und vergleicht diese mit einer weiteren Zellart, den Mesophyllzellen. Zu diesem Zweck wurden die in dieser Doktorarbeit vorgestellten Methoden angewandt um metabolische Vorhersagen speziell für Schließzellen und Mesophyllzellen zu erhalten. Anschließend wurden die zellspezifischen Vorhersagen  auf Unterschiede in der metabolischen Aktivität der Zelltypen, unter Berücksichtigung des Effekt von alternativen Optima, untersucht. Die Ergebnisse der Simulationen legen eine grundlegende Neuorganisation des Primärmetabolismus in Schließzellen verglichen mit Mesophyllzellen nahe. Diese Ergebnisse werden durch unabhängige 13C  markierungs Experimente bestätigt.
KW  - systems biology
KW  - bioinformatics
KW  - metabolic networks
KW  - constraint-based modeling
KW  - data integration
KW  - Systemsbiologie
KW  - Bioinformatik
KW  - Stoffwechselnetze
KW  - Constraint-basierte Modellierung
KW  - Datenintegration
Y1  - 2017
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-401365
ER  - 
TY  - THES
A1  - Bauckmann, Jana
T1  - Dependency discovery for data integration
T1  - Erkennen von Datenabhängigkeiten zur Datenintegration
N2  - Data integration aims to combine data of different sources and to provide users with a unified view on these data. This task is as challenging as valuable. In this thesis we propose algorithms for dependency discovery to provide necessary information for data integration. We focus on inclusion dependencies (INDs) in general and a special form named conditional inclusion dependencies (CINDs): (i) INDs enable the discovery of structure in a given schema. (ii) INDs and CINDs support the discovery of cross-references or links between schemas. An IND “A in B” simply states that all values of attribute A are included in the set of values of attribute B. We propose an algorithm that discovers all inclusion dependencies in a relational data source. The challenge of this task is the complexity of testing all attribute pairs and further of comparing all of each attribute pair's values. The complexity of existing approaches depends on the number of attribute pairs, while ours depends only on the number of attributes. Thus, our algorithm enables to profile entirely unknown data sources with large schemas by discovering all INDs. Further, we provide an approach to extract foreign keys from the identified INDs. We extend our IND discovery algorithm to also find three special types of INDs: (i) Composite INDs, such as “AB in CD”, (ii) approximate INDs that allow a certain amount of values of A to be not included in B, and (iii) prefix and suffix INDs that represent special cross-references between schemas. Conditional inclusion dependencies are inclusion dependencies with a limited scope defined by conditions over several attributes. Only the matching part of the instance must adhere the dependency. We generalize the definition of CINDs distinguishing covering and completeness conditions and define quality measures for conditions. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. The challenge for this task is twofold: (i) Which (and how many) attributes should be used for the conditions? (ii) Which attribute values should be chosen for the conditions? Previous approaches rely on pre-selected condition attributes or can only discover conditions applying to quality thresholds of 100%. Our approaches were motivated by two application domains: data integration in the life sciences and link discovery for linked open data. We show the efficiency and the benefits of our approaches for use cases in these domains.
N2  - Datenintegration hat das Ziel, Daten aus unterschiedlichen Quellen zu kombinieren und Nutzern eine einheitliche Sicht auf diese Daten zur Verfügung zu stellen. Diese Aufgabe ist gleichermaßen anspruchsvoll wie wertvoll. In dieser Dissertation werden Algorithmen zum Erkennen von Datenabhängigkeiten vorgestellt, die notwendige Informationen zur Datenintegration liefern. Der Schwerpunkt dieser Arbeit liegt auf Inklusionsabhängigkeiten (inclusion dependency, IND) im Allgemeinen und auf der speziellen Form der Bedingten Inklusionsabhängigkeiten (conditional inclusion dependency, CIND): (i) INDs ermöglichen das Finden von Strukturen in einem gegebenen Schema. (ii) INDs und CINDs unterstützen das Finden von Referenzen zwischen Datenquellen. Eine IND „A in B“ besagt, dass alle Werte des Attributs A in der Menge der Werte des Attributs B enthalten sind. Diese Arbeit liefert einen Algorithmus, der alle INDs in einer relationalen Datenquelle erkennt. Die Herausforderung dieser Aufgabe liegt in der Komplexität alle Attributpaare zu testen und dabei alle Werte dieser Attributpaare zu vergleichen. Die Komplexität bestehender Ansätze ist abhängig von der Anzahl der Attributpaare während der hier vorgestellte Ansatz lediglich von der Anzahl der Attribute abhängt. Damit ermöglicht der vorgestellte Algorithmus unbekannte Datenquellen mit großen Schemata zu untersuchen. Darüber hinaus wird der Algorithmus erweitert, um drei spezielle Formen von INDs zu finden, und ein Ansatz vorgestellt, der Fremdschlüssel aus den erkannten INDs filtert. Bedingte Inklusionsabhängigkeiten (CINDs) sind Inklusionsabhängigkeiten deren Geltungsbereich durch Bedingungen über bestimmten Attributen beschränkt ist. Nur der zutreffende Teil der Instanz muss der Inklusionsabhängigkeit genügen. Die Definition für CINDs wird in der vorliegenden Arbeit generalisiert durch die Unterscheidung von überdeckenden und vollständigen Bedingungen. Ferner werden Qualitätsmaße für Bedingungen definiert. Es werden effiziente Algorithmen vorgestellt, die überdeckende und vollständige Bedingungen mit gegebenen Qualitätsmaßen auffinden. Dabei erfolgt die Auswahl der verwendeten Attribute und Attributkombinationen sowie der Attributwerte automatisch. Bestehende Ansätze beruhen auf einer Vorauswahl von Attributen für die Bedingungen oder erkennen nur Bedingungen mit Schwellwerten von 100% für die Qualitätsmaße. Die Ansätze der vorliegenden Arbeit wurden durch zwei Anwendungsbereiche motiviert: Datenintegration in den Life Sciences und das Erkennen von Links in Linked Open Data. Die Effizienz und der Nutzen der vorgestellten Ansätze werden anhand von Anwendungsfällen in diesen Bereichen aufgezeigt.
KW  - Datenabhängigkeiten-Entdeckung
KW  - Datenintegration
KW  - Schema-Entdeckung
KW  - Link-Entdeckung
KW  - Inklusionsabhängigkeit
KW  - dependency discovery
KW  - data integration
KW  - schema discovery
KW  - link discovery
KW  - inclusion dependency
Y1  - 2013
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-66645
ER  - 
TY  - JOUR
A1  - Paasche, Hendrik
A1  - Eberle, Detlef
T1  - Automated compilation of pseudo-lithology maps from geophysical data sets a comparison of Gustafson-Kessel and fuzzy c-means cluster algorithms
JF  - Exploration geophysics : the bulletin of the Australian Society of Exploration Geophysicists
N2  - The fuzzy partitioning Gustafson-Kessel cluster algorithm is employed for rapid and objective integration of multi-parameter Earth-science related databases. We begin by evaluating the Gustafson-Kessel algorithm using the example of a synthetic study and compare the results to those obtained from the more widely employed fuzzy c-means algorithm. Since the Gustafson-Kessel algorithm goes beyond the potential of the fuzzy c-means algorithm by adapting the shape of the clusters to be detected and enabling a manual control of the cluster volume, we believe the results obtained from Gustafson-Kessel algorithm to be superior. Accordingly, a field database comprising airborne and ground-based geophysical data sets is analysed, which has previously been classified by means of the fuzzy c-means algorithm. This database is integrated using the Gustafson-Kessel algorithm thus minimising the amount of empirical data processing required before and after fuzzy c-means clustering. The resultant zonal geophysical map is more evenly clustered matching regional geology information available from the survey area. Even additional information about linear structures, e. g. as typically caused by the presence of dolerite dykes or faults, is visible in the zonal map obtained from Gustafson-Kessel cluster analysis.
KW  - cluster analysis
KW  - data integration
KW  - airborne
KW  - South Africa
KW  - Gustafson-Kessel
KW  - fuzzy c-means
Y1  - 2011
U6  - https://doi.org/10.1071/EG11014
SN  - 0812-3985
VL  - 42
IS  - 4
SP  - 275
EP  - 285
PB  - CSIRO
CY  - Collingwood
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Leser, Ulf
A1  - Naumann, Felix
T1  - Efficient and exact computation of inclusion dependencies for data integration
N2  - Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 34 
KW  - Metadatenentdeckung
KW  - Metadatenqualität
KW  - Schemaentdeckung
KW  - Datenanalyse
KW  - Datenintegration
KW  - metadata discovery
KW  - metadata quality
KW  - schema discovery
KW  - data profiling
KW  - data integration
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-41396
SN  - 978-3-86956-048-9
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  -