Refine
Year of publication
Document Type
- Article (15)
- Monograph/Edited Volume (9)
Language
- English (24) (remove)
Is part of the Bibliography
- yes (24)
Keywords
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
- Functional dependencies (2)
- data matching (2)
- data quality (2)
- data wrangling (2)
- entity resolution (2)
- record linkage (2)
- Address matching (1)
- Approximation algorithms (1)
- Apriori (1)
- Association Rule Mining (1)
- Assoziationsregeln (1)
- Bedingte Inklusionsabhängigkeiten (1)
- Big Data (1)
- Complexity theory (1)
- Conditional Inclusion Dependency (1)
- Data Dependency (1)
- Data Profiling (1)
- Data Quality (1)
- Data Warehouse (1)
- Data dependencies (1)
- Data profiling (1)
- Datenabhängigkeiten (1)
- Datenanalyse (1)
- Datenqualität (1)
- Distributed (1)
- Duplicate Detection (1)
- Duplikaterkennung (1)
- Entity resolution (1)
- Erkennen von Meta-Daten (1)
- Extract-Transform-Load (ETL) (1)
- Forschungskolleg (1)
- Hasso Plattner Institute (1)
- Hasso-Plattner-Institut (1)
- Inclusion dependencies (1)
- Information Extraction (1)
- Information Systems (1)
- Informationsextraktion (1)
- Informationssysteme (1)
- Klausurtagung (1)
- Lakes (1)
- Link Discovery (1)
- Link-Entdeckung (1)
- Linked Data (1)
- Linked Open Data (1)
- Metadata Discovery (1)
- Metadatenentdeckung (1)
- Metadatenqualität (1)
- Order dependencies (1)
- Ph.D. Retreat (1)
- Query execution (1)
- Query optimization (1)
- Record linkage (1)
- Relational data (1)
- Research School (1)
- SQL (1)
- Schemaentdeckung (1)
- Schlüsselentdeckung (1)
- Semantics (1)
- Service-oriented Systems Engineering (1)
- Unique column combinations (1)
- Wikipedia (1)
- address normalization (1)
- address parsing (1)
- apriori (1)
- clustering (1)
- conditional functional dependencies (1)
- contract (1)
- corporate takeovers (1)
- data cleaning (1)
- data cleansing (1)
- data integration (1)
- data preparation (1)
- databases (1)
- deduplication (1)
- dependency discovery (1)
- eindeutig (1)
- explainability (1)
- explainability-accuracy trade-off (1)
- explainable AI (1)
- functional dependencies (1)
- functional dependency (1)
- funktionale Abhängigkeit (1)
- geocoding (1)
- geographic information systems (1)
- interpretable machine learning (1)
- key discovery (1)
- law (1)
- matching dependencies (1)
- medical malpractice (1)
- metadata discovery (1)
- metadata quality (1)
- metric learning (1)
- networks (1)
- neural (1)
- random forest (1)
- schema discovery (1)
- similarity learning (1)
- tort law (1)
- transfer learning (1)
- unique (1)
Institute
- Hasso-Plattner-Institut für Digital Engineering gGmbH (24) (remove)
Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application. Commonly used technologies, such as J2EE and .NET, form de facto standards for the realization of complex distributed systems. Evolution of component systems has lead to web services and service-based architectures. This has been manifested in a multitude of industry standards and initiatives such as XML, WSDL UDDI, SOAP, etc. All these achievements lead to a new and promising paradigm in IT systems engineering which proposes to design complex software solutions as collaboration of contractually defined software services. Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns. The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the Research Scholl, this technical report covers a wide range of research topics. These include but are not limited to: Self-Adaptive Service-Oriented Systems, Operating System Support for Service-Oriented Systems, Architecture and Modeling of Service-Oriented Systems, Adaptive Process Management, Services Composition and Workflow Planning, Security Engineering of Service-Based IT Systems, Quantitative Analysis and Optimization of Service-Oriented Systems, Service-Oriented Systems in 3D Computer Graphics sowie Service-Oriented Geoinformatics.
The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. <br /> We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.
Entity resolution on-demand
(2022)
Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity.
ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them.
For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data.
Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner-a fundamental requirement of ELT (Extract-Load-Transform) pipelines.
We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution.
For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets.