TY - BOOK A1 - Albrecht, Alexander A1 - Naumann, Felix T1 - Understanding cryptic schemata in large extract-transform-load systems N2 - Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema. N2 - Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 60 KW - Extract-Transform-Load (ETL) KW - Data Warehouse KW - Datenintegration KW - Extract-Transform-Load (ETL) KW - Data Warehouse KW - Data Integration Y1 - 2012 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-61257 SN - 978-3-86956-201-8 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - BOOK A1 - Herschel, Melanie A1 - Naumann, Felix T1 - Space and time scalability of duplicate detection in graph data N2 - Duplicate detection consists in determining different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. In this paper we scale up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we first generalize the process of DDG. We then present how to scale algorithms for DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data not residing in main memory. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 25 Y1 - 2008 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-32851 SN - 978-3-940793-46-1 ER - TY - BOOK A1 - Meinel, Christoph A1 - Plattner, Hasso A1 - Döllner, Jürgen Roland Friedrich A1 - Weske, Mathias A1 - Polze, Andreas A1 - Hirschfeld, Robert A1 - Naumann, Felix A1 - Giese, Holger A1 - Baudisch, Patrick T1 - Proceedings of the 7th Ph.D. Retreat of the HPI Research School on Service-oriented Systems Engineering N2 - Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application. Commonly used technologies, such as J2EE and .NET, form de facto standards for the realization of complex distributed systems. Evolution of component systems has lead to web services and service-based architectures. This has been manifested in a multitude of industry standards and initiatives such as XML, WSDL UDDI, SOAP, etc. All these achievements lead to a new and promising paradigm in IT systems engineering which proposes to design complex software solutions as collaboration of contractually defined software services. Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns. The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the Research Scholl, this technical report covers a wide range of research topics. These include but are not limited to: Self-Adaptive Service-Oriented Systems, Operating System Support for Service-Oriented Systems, Architecture and Modeling of Service-Oriented Systems, Adaptive Process Management, Services Composition and Workflow Planning, Security Engineering of Service-Based IT Systems, Quantitative Analysis and Optimization of Service-Oriented Systems, Service-Oriented Systems in 3D Computer Graphics sowie Service-Oriented Geoinformatics. N2 - Der Entwurf und die Realisierung dienstbasierender Architekturen wirft eine Vielzahl von Forschungsfragestellungen aus den Gebieten der Softwaretechnik, der Systemmodellierung und -analyse, sowie der Adaptierbarkeit und Integration von Applikationen auf. Komponentenorientierung und WebServices sind zwei Ansätze für den effizienten Entwurf und die Realisierung komplexer Web-basierender Systeme. Sie ermöglichen die Reaktion auf wechselnde Anforderungen ebenso, wie die Integration großer komplexer Softwaresysteme. Heute übliche Technologien, wie J2EE und .NET, sind de facto Standards für die Entwicklung großer verteilter Systeme. Die Evolution solcher Komponentensysteme führt über WebServices zu dienstbasierenden Architekturen. Dies manifestiert sich in einer Vielzahl von Industriestandards und Initiativen wie XML, WSDL, UDDI, SOAP. All diese Schritte führen letztlich zu einem neuen, vielversprechenden Paradigma für IT Systeme, nach dem komplexe Softwarelösungen durch die Integration vertraglich vereinbarter Software-Dienste aufgebaut werden sollen. "Service-Oriented Systems Engineering" repräsentiert die Symbiose bewährter Praktiken aus den Gebieten der Objektorientierung, der Komponentenprogrammierung, des verteilten Rechnen sowie der Geschäftsprozesse und berücksichtigt auch die Integration von Geschäftsanliegen und Informationstechnologien. Die Klausurtagung des Forschungskollegs "Service-oriented Systems Engineering" findet einmal jährlich statt und bietet allen Kollegiaten die Möglichkeit den Stand ihrer aktuellen Forschung darzulegen. Bedingt durch die Querschnittstruktur des Kollegs deckt dieser Bericht ein große Bandbreite aktueller Forschungsthemen ab. Dazu zählen unter anderem Self-Adaptive Service-Oriented Systems, Operating System Support for Service-Oriented Systems, Architecture and Modeling of Service-Oriented Systems, Adaptive Process Management, Services Composition and Workflow Planning, Security Engineering of Service-Based IT Systems, Quantitative Analysis and Optimization of Service-Oriented Systems, Service-Oriented Systems in 3D Computer Graphics sowie Service-Oriented Geoinformatics. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 83 KW - Hasso-Plattner-Institut KW - Forschungskolleg KW - Klausurtagung KW - Service-oriented Systems Engineering KW - Hasso Plattner Institute KW - Research School KW - Ph.D. Retreat KW - Service-oriented Systems Engineering Y1 - 2014 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-63490 SN - 978-3-86956-273-5 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - JOUR A1 - Loster, Michael A1 - Koumarelas, Ioannis A1 - Naumann, Felix T1 - Knowledge transfer for entity resolution with siamese neural networks JF - ACM journal of data and information quality N2 - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.
We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent. KW - Entity resolution KW - duplicate detection KW - transfer learning KW - neural KW - networks KW - metric learning KW - similarity learning KW - data quality Y1 - 2021 U6 - https://doi.org/10.1145/3410157 SN - 1936-1955 SN - 1936-1963 VL - 13 IS - 1 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Bonifati, Angela A1 - Mior, Michael J. A1 - Naumann, Felix A1 - Noack, Nele Sina T1 - How inclusive are we? BT - an analysis of gender diversity in database venues JF - SIGMOD record / Association for Computing Machinery, Special Interest Group on Management of Data N2 - ACM SIGMOD, VLDB and other database organizations have committed to fostering an inclusive and diverse community, as do many other scientific organizations. Recently, different measures have been taken to advance these goals, especially for underrepresented groups. One possible measure is double-blind reviewing, which aims to hide gender, ethnicity, and other properties of the authors.
We report the preliminary results of a gender diversity analysis of publications of the database community across several peer-reviewed venues, and also compare women's authorship percentages in both single-blind and double-blind venues along the years. We also obtained a cross comparison of the obtained results in data management with other relevant areas in Computer Science. Y1 - 2022 U6 - https://doi.org/10.1145/3516431.3516438 SN - 0163-5808 SN - 1943-5835 VL - 50 IS - 4 SP - 30 EP - 35 PB - Association for Computing Machinery CY - New York ER - TY - BOOK A1 - Lange, Dustin A1 - Böhm, Christoph A1 - Naumann, Felix T1 - Extracting structured information from Wikipedia articles to populate infoboxes N2 - Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes. N2 - Ungefähr jeder dritte Wikipedia-Artikel enthält eine Infobox - eine Tabelle, die wichtige Fakten über das beschriebene Thema in Attribut-Wert-Form darstellt. Das Schema einer Infobox, d.h. die Attribute, die für ein Konzept verwendet werden können, wird durch ein Infobox-Template definiert. Häufig geben Autoren nicht für alle Template-Attribute Werte an, wodurch unvollständige Infoboxen entstehen. Mit iPopulator stellen wir ein System vor, welches automatisch Infoboxen von Wikipedia-Artikeln durch Extrahieren von Attributwerten aus dem Artikeltext befüllt. Im Unterschied zu früheren Arbeiten erkennt iPopulator die Struktur von Attributwerten und nutzt diese aus, um die einzelnen Bestandteile von Attributwerten unabhängig voneinander zu extrahieren. Wir haben iPopulator auf der gesamten Menge der Infobox-Templates getestet und analysieren detailliert die Effektivität. Wir erreichen beispielsweise für die Extraktion einen durchschnittlichen Precision-Wert von 91% für 1.727 verschiedene Infobox-Template-Attribute. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 38 KW - Informationsextraktion KW - Wikipedia KW - Linked Data KW - Information Extraction KW - Wikipedia KW - Linked Data Y1 - 2010 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-45714 SN - 978-3-86956-081-6 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - JOUR A1 - Hacker, Philipp A1 - Krestel, Ralf A1 - Grundmann, Stefan A1 - Naumann, Felix T1 - Explainable AI under contract and tort law BT - legal incentives and technical challenges JF - Artificial intelligence and law N2 - This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification. KW - explainability KW - explainable AI KW - interpretable machine learning KW - contract KW - law KW - tort law KW - explainability-accuracy trade-off KW - medical malpractice KW - corporate takeovers Y1 - 2020 U6 - https://doi.org/10.1007/s10506-020-09260-6 SN - 0924-8463 SN - 1572-8382 VL - 28 IS - 4 SP - 415 EP - 439 PB - Springer CY - Dordrecht ER - TY - BOOK A1 - Bauckmann, Jana A1 - Leser, Ulf A1 - Naumann, Felix T1 - Efficient and exact computation of inclusion dependencies for data integration N2 - Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation. T3 - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 34 KW - Metadatenentdeckung KW - Metadatenqualität KW - Schemaentdeckung KW - Datenanalyse KW - Datenintegration KW - metadata discovery KW - metadata quality KW - schema discovery KW - data profiling KW - data integration Y1 - 2010 U6 - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-41396 SN - 978-3-86956-048-9 PB - Universitätsverlag Potsdam CY - Potsdam ER - TY - JOUR A1 - Vitagliano, Gerardo A1 - Jiang, Lan A1 - Naumann, Felix T1 - Detecting layout templates in complex multiregion files JF - Proceedings of the VLDB Endowment N2 - Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files. Y1 - 2022 U6 - https://doi.org/10.14778/3494124.3494145 SN - 2150-8097 VL - 15 IS - 3 SP - 646 EP - 658 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Koumarelas, Ioannis A1 - Jiang, Lan A1 - Naumann, Felix T1 - Data preparation for duplicate detection JF - Journal of data and information quality : (JDIQ) N2 - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.
Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR. KW - data preparation KW - data wrangling KW - record linkage KW - duplicate detection KW - similarity measures Y1 - 2020 U6 - https://doi.org/10.1145/3377878 SN - 1936-1955 SN - 1936-1963 VL - 12 IS - 3 PB - Association for Computing Machinery CY - New York ER -