TY  - JOUR
A1  - Bleifuss, Tobias
A1  - Bornemann, Leon
A1  - Johnson, Theodore
A1  - Kalashnikov, Dmitri
A1  - Naumann, Felix
A1  - Srivastava, Divesh
T1  - Exploring Change
BT  - a new dimension of data analytics
JF  - Proceedings of the VLDB Endowment
N2  - Data and metadata in datasets experience many different kinds of change. Values axe inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Y1  - 2018
U6  - https://doi.org/10.14778/3282495.3282496
SN  - 2150-8097
VL  - 12
IS  - 2
SP  - 85
EP  - 98
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Berti-Equille, Laure
A1  - Harmouch, Nazar
A1  - Naumann, Felix
A1  - Novelli, Noel
A1  - Saravanan, Thirumuruganathan
T1  - Discovery of genuine functional dependencies from relational data with missing values
JF  - Proceedings of the VLDB Endowment
N2  - Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
Y1  - 2018
U6  - https://doi.org/10.14778/3204028.3204032
SN  - 2150-8097
VL  - 11
IS  - 8
SP  - 880
EP  - 892
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - MDedup
BT  - duplicate detection with matching dependencies
JF  - Proceedings of the VLDB Endowment
N2  - Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. 
For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.
Y1  - 2020
U6  - https://doi.org/10.14778/3377369.3377379
SN  - 2150-8097
VL  - 13
IS  - 5
SP  - 712
EP  - 725
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Data preparation for duplicate detection
JF  - Journal of data and information quality : (JDIQ)
N2  - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. <br /> Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW  - data preparation
KW  - data wrangling
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3377878
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - BOOK
A1  - Draisbach, Uwe
A1  - Naumann, Felix
A1  - Szott, Sascha
A1  - Wonneberg, Oliver
T1  - Adaptive windows for duplicate detection
N2  - Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
N2  - Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 49 
KW  - Informationssysteme
KW  - Datenqualität
KW  - Datenintegration
KW  - Duplikaterkennung
KW  - Duplicate Detection
KW  - Data Quality
KW  - Data Integration
KW  - Information Systems
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-53007
SN  - 978-3-86956-143-1
SN  - 1613-5652
SN  - 2191-1665
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Koumarelas, Ioannis
A1  - Kroschk, Axel
A1  - Mosley, Clifford
A1  - Naumann, Felix
T1  - Experience: Enhancing address matching with geocoding and similarity measure selection
JF  - Journal of Data and Information Quality
N2  - Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
KW  - Address matching
KW  - record linkage
KW  - duplicate detection
KW  - similarity measures
KW  - conditional functional dependencies
KW  - address normalization
KW  - address parsing
KW  - geocoding
KW  - geographic information systems
KW  - random forest
Y1  - 2018
U6  - https://doi.org/10.1145/3232852
SN  - 1936-1955
VL  - 10
IS  - 2
SP  - 1
EP  - 16
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Loster, Michael
A1  - Naumann, Felix
A1  - Ehmueller, Jan
A1  - Feldmann, Benjamin
T1  - CurEx
BT  - a system for extracting, curating, and exploring domain-specific knowledge graphs from text
T2  - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
N2  - The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.
Y1  - 2018
SN  - 978-1-4503-6014-2
U6  - https://doi.org/10.1145/3269206.3269229
SP  - 1883
EP  - 1886
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Hameed, Mazhar
A1  - Naumann, Felix
T1  - Data Preparation
BT  - a survey of commercial tools
JF  - SIGMOD record
N2  - Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. <br /> To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.
KW  - data quality
KW  - data cleaning
KW  - data wrangling
Y1  - 2020
U6  - https://doi.org/10.1145/3444831.3444835
SN  - 0163-5808
SN  - 1943-5835
VL  - 49
IS  - 3
SP  - 18
EP  - 29
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Holistic primary key and foreign key detection
JF  - Journal of intelligent information systems : JIIS
N2  - Primary keys (PKs) and foreign keys (FKs) are important elements of relational schemata in various applications, such as query optimization and data integration. However, in many cases, these constraints are unknown or not documented. Detecting them manually is time-consuming and even infeasible in large-scale datasets. We study the problem of discovering primary keys and foreign keys automatically and propose an algorithm to detect both, namely Holistic Primary Key and Foreign Key Detection (HoPF). PKs and FKs are subsets of the sets of unique column combinations (UCCs) and inclusion dependencies (INDs), respectively, for which efficient discovery algorithms are known. Using score functions, our approach is able to effectively extract the true PKs and FKs from the vast sets of valid UCCs and INDs. Several pruning rules are employed to speed up the procedure. We evaluate precision and recall on three benchmarks and two real-world datasets. The results show that our method is able to retrieve on average 88% of all primary keys, and 91% of all foreign keys. We compare the performance of HoPF with two baseline approaches that both assume the existence of primary keys.
KW  - Data profiling application
KW  - Primary key
KW  - Foreign key
KW  - Database
KW  - management
Y1  - 2019
U6  - https://doi.org/10.1007/s10844-019-00562-z
SN  - 0925-9902
SN  - 1573-7675
VL  - 54
IS  - 3
SP  - 439
EP  - 461
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Schirmer, Philipp
A1  - Papenbrock, Thorsten
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Efficient discovery of matching dependencies
JF  - ACM transactions on database systems : TODS
N2  - Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. 
We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.
KW  - matching dependencies
KW  - functional dependencies
KW  - dependency discovery
KW  - data profiling
KW  - data matching
KW  - entity resolution
KW  - similarity measures
Y1  - 2020
U6  - https://doi.org/10.1145/3392778
SN  - 0362-5915
SN  - 1557-4644
VL  - 45
IS  - 3
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - JOUR
A1  - Hacker, Philipp
A1  - Krestel, Ralf
A1  - Grundmann, Stefan
A1  - Naumann, Felix
T1  - Explainable AI under contract and tort law
BT  - legal incentives and technical challenges
JF  - Artificial intelligence and law
N2  - This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.
KW  - explainability
KW  - explainable AI
KW  - interpretable machine learning
KW  - contract
KW  - law
KW  - tort law
KW  - explainability-accuracy trade-off
KW  - medical malpractice
KW  - corporate takeovers
Y1  - 2020
U6  - https://doi.org/10.1007/s10506-020-09260-6
SN  - 0924-8463
SN  - 1572-8382
VL  - 28
IS  - 4
SP  - 415
EP  - 439
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Draisbach, Uwe
A1  - Christen, Peter
A1  - Naumann, Felix
T1  - Transforming pairwise duplicates to entity clusters for high-quality duplicate detection
JF  - ACM Journal of Data and Information Quality
N2  - Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. <br /> We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
KW  - Record linkage
KW  - data matching
KW  - entity resolution
KW  - deduplication
KW  - clustering
Y1  - 2019
U6  - https://doi.org/10.1145/3352591
SN  - 1936-1955
SN  - 1936-1963
VL  - 12
IS  - 1
SP  - 1
EP  - 30
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - BOOK
A1  - Naumann, Felix
T1  - Informationsqualität : Antrittsvorlesung 2007-04-26
N2  - Sowohl in kommerziellen als auch in wissenschaftlichen Datenbanken sind Daten von niedriger Qualität allgegenwärtig. Das kann zu erheblichen wirtschaftlichen Problemen führen", erläutert der 35-jährige Informatik-Professor und verweist zum Beispiel auf Duplikate. Diese können entstehen, wenn in Unternehmen verschiedene Kundendatenbestände zusammengefügt werden, aber die Integration mehrere Datensätze des gleichen Kunden hinterlässt. "Solche doppelten Einträge zu finden, ist aus zwei Gründen schwierig: Zum einen ist die Menge der Daten oft sehr groß, zum anderen können sich Einträge über die gleiche Person leicht unterscheiden", beschreibt Prof. Naumann häufig auftretende Probleme. In seiner Antrittsvorlesung will er zwei Lösungswege vorstellen: Erstens die Definition geeigneter Ähnlichkeitsmaße und zweitens die Nutzung von Algorithmen, die es vermeiden, jeden Datensatz mit jedem anderen zu vergleichen. Außerdem soll es um grundlegende Aspekte der Verständlichkeit, Objektivität, Vollständigkeit und Fehlerhaftigkeit von Daten gehen.
Y1  - 2008
UR  - http://info.ub.uni-potsdam.de/multimedia/show_projekt.php?projekt_id=9
PB  - Univ.-Bibl.
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Momtazi, Saeedeh
A1  - Naumann, Felix
T1  - Topic modeling for expert finding using latent Dirichlet allocation
JF  - Wiley interdisciplinary reviews : Data mining and knowledge discovery
N2  - The task of expert finding is to rank the experts in the search space given a field of expertise as an input query. In this paper, we propose a topic modeling approach for this task. The proposed model uses latent Dirichlet allocation (LDA) to induce probabilistic topics. In the first step of our algorithm, the main topics of a document collection are extracted using LDA. The extracted topics present the connection between expert candidates and user queries. In the second step, the topics are used as a bridge to find the probability of selecting each candidate for a given query. The candidates are then ranked based on these probabilities. The experimental results on the Text REtrieval Conference (TREC) Enterprise track for 2005 and 2006 show that the proposed topic-based approach outperforms the state-of-the-art profile- and document-based models, which use information retrieval methods to rank experts. Moreover, we present the superiority of the proposed topic-based approach to the improved document-based expert finding systems, which consider additional information such as local context, candidate prior, and query expansion.
Y1  - 2013
U6  - https://doi.org/10.1002/widm.1102
SN  - 1942-4787
VL  - 3
IS  - 5
SP  - 346
EP  - 353
PB  - Wiley
CY  - San Fransisco
ER  - 
TY  - BOOK
A1  - Albrecht, Alexander
A1  - Naumann, Felix
T1  - Understanding cryptic schemata in large extract-transform-load systems
N2  - Extract-Transform-Load (ETL) tools are used for the creation, maintenance, and evolution of data warehouses, data marts, and operational data stores. ETL workflows populate those systems with data from various data sources by specifying and executing a DAG of transformations. Over time, hundreds of individual workflows evolve as new sources and new requirements are integrated into the system. The maintenance and evolution of large-scale ETL systems requires much time and manual effort. A key problem is to understand the meaning of unfamiliar attribute labels in source and target databases and ETL transformations. Hard-to-understand attribute labels lead to frustration and time spent to develop and understand ETL workflows. We present a schema decryption technique to support ETL developers in understanding cryptic schemata of sources, targets, and ETL transformations. For a given ETL system, our recommender-like approach leverages the large number of mapped attribute labels in existing ETL workflows to produce good and meaningful decryptions. In this way we are able to decrypt attribute labels consisting of a number of unfamiliar few-letter abbreviations, such as UNP_PEN_INT, which we can decrypt to UNPAID_PENALTY_INTEREST. We evaluate our schema decryption approach on three real-world repositories of ETL workflows and show that our approach is able to suggest high-quality decryptions for cryptic attribute labels in a given schema.
N2  - Extract-Transform-Load (ETL) Tools werden häufig beim Erstellen, der Wartung und der Weiterentwicklung von Data Warehouses, Data Marts und operationalen Datenbanken verwendet. ETL Workflows befüllen diese Systeme mit Daten aus vielen unterschiedlichen Quellsystemen. Ein ETL Workflow besteht aus mehreren Transformationsschritten, die einen DAG-strukturierter Graphen bilden. Mit der Zeit entstehen hunderte individueller ETL Workflows, da neue Datenquellen integriert oder neue Anforderungen umgesetzt werden müssen. Die Wartung und Weiterentwicklung von großen ETL Systemen benötigt viel Zeit und manuelle Arbeit. Ein zentrales Problem ist dabei das Verständnis unbekannter Attributnamen in Quell- und Zieldatenbanken und ETL Transformationen. Schwer verständliche Attributnamen führen zu Frustration und hohen Zeitaufwänden bei der Entwicklung und dem Verständnis von ETL Workflows. Wir präsentieren eine Schema Decryption Technik, die ETL Entwicklern das Verständnis kryptischer Schemata in Quell- und Zieldatenbanken und ETL Transformationen erleichtert. Unser Ansatz berücksichtigt für ein gegebenes ETL System die Vielzahl verknüpfter Attributnamen in den existierenden ETL Workflows. So werden gute und aussagekräftige "Decryptions" gefunden und wir sind in der Lage Attributnamen, die aus unbekannten Abkürzungen bestehen, zu "decrypten". So wird z.B. für den Attributenamen UNP_PEN_INT als Decryption UNPAIN_PENALTY_INTEREST vorgeschlagen. Unser Schema Decryption Ansatz wurde für drei ETL-Repositories evaluiert und es zeigte sich, dass unser Ansatz qualitativ hochwertige Decryptions für kryptische Attributnamen vorschlägt.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 60 
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Datenintegration
KW  - Extract-Transform-Load (ETL)
KW  - Data Warehouse
KW  - Data Integration
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-61257
SN  - 978-3-86956-201-8
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Abedjan, Ziawasch
A1  - Leser, Ulf
A1  - Müller, Heiko
A1  - Naumann, Felix
T1  - Covering or complete? : Discovering conditional inclusion dependencies
N2  - Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
N2  - Datenabhängigkeiten (wie zum Beispiel Integritätsbedingungen), werden verwendet, um die Qualität eines Datenbankschemas zu erhöhen, um Anfragen zu optimieren und um Konsistenz in einer Datenbank sicherzustellen. In den letzten Jahren wurden bedingte Abhängigkeiten (conditional dependencies) vorgestellt, die die Qualität von Daten analysieren und verbessern sollen. Eine bedingte Abhängigkeit ist eine Abhängigkeit mit begrenztem Gültigkeitsbereich, der über Bedingungen auf einem oder mehreren Attributen definiert wird. In diesem Bericht betrachten wir bedingte Inklusionsabhängigkeiten (conditional inclusion dependencies; CINDs). Wir generalisieren die Definition von CINDs anhand der Unterscheidung von überdeckenden (covering) und vollständigen (completeness) Bedingungen. Wir stellen einen Anwendungsfall für solche CINDs vor, der den Nutzen von CINDs bei der Lösung komplexer Datenqualitätsprobleme aufzeigt. Darüber hinaus definieren wir Qualitätsmaße für Bedingungen basierend auf Sensitivität und Genauigkeit. Wir stellen effiziente Algorithmen vor, die überdeckende und vollständige Bedingungen innerhalb vorgegebener Schwellwerte finden. Unsere Algorithmen wählen nicht nur die Werte der Bedingungen, sondern finden auch die Bedingungsattribute automatisch. Abschließend zeigen wir, dass unser Ansatz effizient sinnvolle und hilfreiche Ergebnisse für den vorgestellten Anwendungsfall liefert.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 62 
KW  - Datenabhängigkeiten
KW  - Bedingte Inklusionsabhängigkeiten
KW  - Erkennen von Meta-Daten
KW  - Linked Open Data
KW  - Link-Entdeckung
KW  - Assoziationsregeln
KW  - Data Dependency
KW  - Conditional Inclusion Dependency
KW  - Metadata Discovery
KW  - Linked Open Data
KW  - Link Discovery
KW  - Association Rule Mining
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-62089
SN  - 978-3-86956-212-4
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Herschel, Melanie
A1  - Naumann, Felix
T1  - Space and time scalability of duplicate detection in graph data
N2  - Duplicate detection consists in determining different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. In this paper we scale up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we first generalize the process of DDG. We then present how to scale algorithms for DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data not residing in main memory.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 25 
Y1  - 2008
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-32851
SN  - 978-3-940793-46-1
ER  - 
TY  - BOOK
A1  - Abedjan, Ziawasch
A1  - Golab, Lukasz
A1  - Naumann, Felix
A1  - Papenbrock, Thorsten
T1  - Data Profiling
T3  - Synthesis lectures on data management, 52
Y1  - 2019
SN  - 978-1-68173-446-0
PB  - Morgan & Claypool Publishers
CY  - San Rafael
ER  - 
TY  - BOOK
A1  - Lange, Dustin
A1  - Böhm, Christoph
A1  - Naumann, Felix
T1  - Extracting structured information from Wikipedia articles to populate infoboxes
N2  - Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.
N2  - Ungefähr jeder dritte Wikipedia-Artikel enthält eine Infobox - eine Tabelle, die wichtige Fakten über das beschriebene Thema in Attribut-Wert-Form darstellt. Das Schema einer Infobox, d.h. die Attribute, die für ein Konzept verwendet werden können, wird durch ein Infobox-Template definiert. Häufig geben Autoren nicht für alle Template-Attribute Werte an, wodurch unvollständige Infoboxen entstehen. Mit iPopulator stellen wir ein System vor, welches automatisch Infoboxen von Wikipedia-Artikeln durch Extrahieren von Attributwerten aus dem Artikeltext befüllt. Im Unterschied zu früheren Arbeiten erkennt iPopulator die Struktur von Attributwerten und nutzt diese aus, um die einzelnen Bestandteile von Attributwerten unabhängig voneinander zu extrahieren. Wir haben iPopulator auf der gesamten Menge der Infobox-Templates getestet und analysieren detailliert die Effektivität. Wir erreichen beispielsweise für die Extraktion einen durchschnittlichen Precision-Wert von 91% für 1.727 verschiedene Infobox-Template-Attribute.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 38 
KW  - Informationsextraktion
KW  - Wikipedia
KW  - Linked Data
KW  - Information Extraction
KW  - Wikipedia
KW  - Linked Data
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-45714
SN  - 978-3-86956-081-6
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Bauckmann, Jana
A1  - Leser, Ulf
A1  - Naumann, Felix
T1  - Efficient and exact computation of inclusion dependencies for data integration
N2  - Data obtained from foreign data sources often come with only superficial structural information, such as relation names and attribute names. Other types of metadata that are important for effective integration and meaningful querying of such data sets are missing. In particular, relationships among attributes, such as foreign keys, are crucial metadata for understanding the structure of an unknown database. The discovery of such relationships is difficult, because in principle for each pair of attributes in the database each pair of data values must be compared. A precondition for a foreign key is an inclusion dependency (IND) between the key and the foreign key attributes. We present with Spider an algorithm that efficiently finds all INDs in a given relational database. It leverages the sorting facilities of DBMS but performs the actual comparisons outside of the database to save computation. Spider analyzes very large databases up to an order of magnitude faster than previous approaches. We also evaluate in detail the effectiveness of several heuristics to reduce the number of necessary comparisons. Furthermore, we generalize Spider to find composite INDs covering multiple attributes, and partial INDs, which are true INDs for all but a certain number of values. This last type is particularly relevant when integrating dirty data as is often the case in the life sciences domain - our driving motivation.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 34 
KW  - Metadatenentdeckung
KW  - Metadatenqualität
KW  - Schemaentdeckung
KW  - Datenanalyse
KW  - Datenintegration
KW  - metadata discovery
KW  - metadata quality
KW  - schema discovery
KW  - data profiling
KW  - data integration
Y1  - 2010
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-41396
SN  - 978-3-86956-048-9
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Birnick, Johann
A1  - Bläsius, Thomas
A1  - Friedrich, Tobias
A1  - Naumann, Felix
A1  - Papenbrock, Thorsten
A1  - Schirneck, Friedrich Martin
T1  - Hitting set enumeration with partial information for unique column combination discovery
JF  - Proceedings of the VLDB Endowment
N2  - Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. <br /> In this paper, we introduce HPIValid, a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.
Y1  - 2020
U6  - https://doi.org/10.14778/3407790.3407824
SN  - 2150-8097
VL  - 13
IS  - 11
SP  - 2270
EP  - 2283
PB  - Association for Computing Machinery
CY  - [New York, NY]
ER  - 
TY  - JOUR
A1  - Caruccio, Loredana
A1  - Deufemia, Vincenzo
A1  - Naumann, Felix
A1  - Polese, Giuseppe
T1  - Discovering relaxed functional dependencies based on multi-attribute dominance
JF  - IEEE transactions on knowledge and data engineering
N2  - With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
KW  - Complexity theory
KW  - Approximation algorithms
KW  - Big Data
KW  - Distributed
KW  - databases
KW  - Semantics
KW  - Lakes
KW  - Functional dependencies
KW  - data profiling
KW  - data cleansing
Y1  - 2020
U6  - https://doi.org/10.1109/TKDE.2020.2967722
SN  - 1041-4347
SN  - 1558-2191
VL  - 33
IS  - 9
SP  - 3212
EP  - 3228
PB  - Institute of Electrical and Electronics Engineers
CY  - New York, NY
ER  - 
TY  - JOUR
A1  - Koßmann, Jan
A1  - Papenbrock, Thorsten
A1  - Naumann, Felix
T1  - Data dependencies for query optimization
BT  - a survey
JF  - The VLDB journal : the international journal on very large data bases / publ. on behalf of the VLDB Endowment
N2  - Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
KW  - Query optimization
KW  - Query execution
KW  - Data dependencies
KW  - Data profiling
KW  - Unique column combinations
KW  - Functional dependencies
KW  - Order dependencies
KW  - Inclusion dependencies
KW  - Relational data
KW  - SQL
Y1  - 2021
U6  - https://doi.org/10.1007/s00778-021-00676-3
SN  - 1066-8888
SN  - 0949-877X
VL  - 31
IS  - 1
SP  - 1
EP  - 22
PB  - Springer
CY  - Berlin ; Heidelberg ; New York
ER  - 
TY  - JOUR
A1  - Loster, Michael
A1  - Koumarelas, Ioannis
A1  - Naumann, Felix
T1  - Knowledge transfer for entity resolution with siamese neural networks
JF  - ACM journal of data and information quality
N2  - The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. <br /> We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.
KW  - Entity resolution
KW  - duplicate detection
KW  - transfer learning
KW  - neural
KW  - networks
KW  - metric learning
KW  - similarity learning
KW  - data quality
Y1  - 2021
U6  - https://doi.org/10.1145/3410157
SN  - 1936-1955
SN  - 1936-1963
VL  - 13
IS  - 1
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - BOOK
A1  - Abedjan, Ziawasch
A1  - Naumann, Felix
T1  - Advancing the discovery of unique column combinations
N2  - Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.
N2  - Unique-Spaltenkombinationen sind Spaltenkombinationen einer Datenbanktabelle, die nur einzigartige Werte beinhalten. Das Finden von Unique-Spaltenkombinationen spielt sowohl eine wichtige Rolle im Bereich der Grundlagenforschung von Informationssystemen als auch in Anwendungsgebieten wie dem Datenmanagement und der Erkenntnisgewinnung aus Datenbeständen. Vorhandene Algorithmen, die dieses Problem angehen, sind entweder Brute-Force oder benötigen zu viel Hauptspeicher. Deshalb können diese Algorithmen nur auf kleine Datenmengen angewendet werden. In dieser Arbeit werden der bekannte GORDIAN-Algorithmus und Apriori-basierte Algorithmen zum Zwecke weiterer Optimierung analysiert. Wir verbessern die Apriori Algorithmen durch eine effiziente Kandidatengenerierung und Heuristikbasierten Kandidatenfilter. Eine Hybride Lösung, HCA-GORDIAN, kombiniert die Vorteile von GORDIAN und unserem neuen Algorithmus HCA, welche die bisherigen Algorithmen hinsichtlich der Effizienz in vielen Situationen übertrifft.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 51 
KW  - Apriori
KW  - eindeutig
KW  - funktionale Abhängigkeit
KW  - Schlüsselentdeckung
KW  - Data Profiling
KW  - apriori
KW  - unique
KW  - functional dependency
KW  - key discovery
KW  - data profiling
Y1  - 2011
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-53564
SN  - 978-3-86956-148-6
SN  - 1613-5652
SN  - 2191-1665
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Bonifati, Angela
A1  - Mior, Michael J.
A1  - Naumann, Felix
A1  - Noack, Nele Sina
T1  - How inclusive are we?
BT  - an analysis of gender diversity in database venues
JF  - SIGMOD record / Association for Computing Machinery, Special Interest Group on Management of Data
N2  - ACM SIGMOD, VLDB and other database organizations have committed to fostering an inclusive and diverse community, as do many other scientific organizations. Recently, different measures have been taken to advance these goals, especially for underrepresented groups. One possible measure is double-blind reviewing, which aims to hide gender, ethnicity, and other properties of the authors. <br /> We report the preliminary results of a gender diversity analysis of publications of the database community across several peer-reviewed venues, and also compare women's authorship percentages in both single-blind and double-blind venues along the years. We also obtained a cross comparison of the obtained results in data management with other relevant areas in Computer Science.
Y1  - 2022
U6  - https://doi.org/10.1145/3516431.3516438
SN  - 0163-5808
SN  - 1943-5835
VL  - 50
IS  - 4
SP  - 30
EP  - 35
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Kruse, Sebastian
A1  - Kaoudi, Zoi
A1  - Quiane-Ruiz, Jorge-Arnulfo
A1  - Chawla, Sanjay
A1  - Naumann, Felix
A1  - Contreras-Rojas, Bertty
T1  - Optimizing Cross-Platform Data Movement
T2  - 2019 IEEE 35th International Conference on Data Engineering (ICDE)
N2  - Data analytics are moving beyond the limits of a single data processing platform. A cross-platform query optimizer is necessary to enable applications to run their tasks over multiple platforms efficiently and in a platform-agnostic manner. For the optimizer to be effective, it must consider data movement costs across different data processing platforms. In this paper, we present the graph-based data movement strategy used by RHEEM, our open-source cross-platform system. In particular, we (i) model the data movement problem as a new graph problem, which we prove to be NP-hard, and (ii) propose a novel graph exploration algorithm, which allows RHEEM to discover multiple hidden opportunities for cross-platform data processing.
Y1  - 2019
SN  - 978-1-5386-7474-1
SN  - 978-1-5386-7475-8
U6  - https://doi.org/10.1109/ICDE.2019.00162
SN  - 1084-4627
SN  - 1063-6382
SP  - 1642
EP  - 1645
PB  - IEEE
CY  - New York
ER  - 
TY  - JOUR
A1  - Vitagliano, Gerardo
A1  - Jiang, Lan
A1  - Naumann, Felix
T1  - Detecting layout templates in complex multiregion files
JF  - Proceedings of the VLDB Endowment
N2  - Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.
Y1  - 2022
U6  - https://doi.org/10.14778/3494124.3494145
SN  - 2150-8097
VL  - 15
IS  - 3
SP  - 646
EP  - 658
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Kruse, Sebastian
A1  - Kaoudi, Zoi
A1  - Contreras-Rojas, Bertty
A1  - Chawla, Sanjay
A1  - Naumann, Felix
A1  - Quiané-Ruiz, Jorge-Arnulfo
T1  - RHEEMix in the data jungle
BT  - a cost-based optimizer for cross-platform systems
T2  - Zweitveröffentlichungen der Universität Potsdam : Reihe der Digital Engineering Fakultät
N2  - Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and platforms; and (iii) an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. We show that our optimizer can perform tasks more than one order of magnitude faster when using multiple platforms than when using a single platform.
T3  - Zweitveröffentlichungen der Universität Potsdam : Reihe der Digital Engineering Fakultät - 22 
KW  - cross-platform
KW  - polystore
KW  - query optimization
KW  - data processing
Y1  - 2020
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-519443
IS  - 6
ER  - 
TY  - JOUR
A1  - Kruse, Sebastian
A1  - Kaoudi, Zoi
A1  - Contreras-Rojas, Bertty
A1  - Chawla, Sanjay
A1  - Naumann, Felix
A1  - Quiane-Ruiz, Jorge-Arnulfo
T1  - RHEEMix in the data jungle
BT  - a cost-based optimizer for cross-platform systems
JF  - The VLDB Journal
N2  - Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and platforms; and (iii) an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. We show that our optimizer can perform tasks more than one order of magnitude faster when using multiple platforms than when using a single platform.
KW  - Cross-platform
KW  - Polystore
KW  - Query optimization
KW  - Data processing
Y1  - 2020
U6  - https://doi.org/10.1007/s00778-020-00612-x
SN  - 1066-8888
SN  - 0949-877X
VL  - 29
IS  - 6
SP  - 1287
EP  - 1310
PB  - Springer
CY  - Berlin
ER  - 
TY  - BOOK
A1  - Meinel, Christoph
A1  - Plattner, Hasso
A1  - Döllner, Jürgen Roland Friedrich
A1  - Weske, Mathias
A1  - Polze, Andreas
A1  - Hirschfeld, Robert
A1  - Naumann, Felix
A1  - Giese, Holger
A1  - Baudisch, Patrick
T1  - Proceedings of the 7th Ph.D. Retreat of the HPI Research School on Service-oriented Systems Engineering
N2  - Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application. Commonly used technologies, such as J2EE and .NET, form de facto standards for the realization of complex distributed systems. Evolution of component systems has lead to web services and service-based architectures. This has been manifested in a multitude of industry standards and initiatives such as XML, WSDL UDDI, SOAP, etc. All these achievements lead to a new and promising paradigm in IT systems engineering which proposes to design complex software solutions as collaboration of contractually defined software services. Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns. The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the Research Scholl, this technical report covers a wide range of research topics. These include but are not limited to: Self-Adaptive Service-Oriented Systems, Operating System Support for Service-Oriented Systems, Architecture and Modeling of Service-Oriented Systems, Adaptive Process Management, Services Composition and Workflow Planning, Security Engineering of Service-Based IT Systems, Quantitative Analysis and Optimization of Service-Oriented Systems, Service-Oriented Systems in 3D Computer Graphics sowie Service-Oriented Geoinformatics.
N2  - Der Entwurf und die Realisierung dienstbasierender Architekturen wirft eine Vielzahl von Forschungsfragestellungen aus den Gebieten der Softwaretechnik, der Systemmodellierung und -analyse, sowie der Adaptierbarkeit und Integration von Applikationen auf. Komponentenorientierung und WebServices sind zwei Ansätze für den effizienten Entwurf und die Realisierung komplexer Web-basierender Systeme. Sie ermöglichen die Reaktion auf wechselnde Anforderungen ebenso, wie die Integration großer komplexer Softwaresysteme. Heute übliche Technologien, wie J2EE und .NET, sind de facto Standards für die Entwicklung großer verteilter Systeme. Die Evolution solcher Komponentensysteme führt über WebServices zu dienstbasierenden Architekturen. Dies manifestiert sich in einer Vielzahl von Industriestandards und Initiativen wie XML, WSDL, UDDI, SOAP. All diese Schritte führen letztlich zu einem neuen, vielversprechenden Paradigma für IT Systeme, nach dem komplexe Softwarelösungen durch die Integration vertraglich vereinbarter Software-Dienste aufgebaut werden sollen. "Service-Oriented Systems Engineering" repräsentiert die Symbiose bewährter Praktiken aus den Gebieten der Objektorientierung, der Komponentenprogrammierung, des verteilten Rechnen sowie der Geschäftsprozesse und berücksichtigt auch die Integration von Geschäftsanliegen und Informationstechnologien. Die Klausurtagung des Forschungskollegs "Service-oriented Systems Engineering" findet einmal jährlich statt und bietet allen Kollegiaten die Möglichkeit den Stand ihrer aktuellen Forschung darzulegen. Bedingt durch die Querschnittstruktur des Kollegs deckt dieser Bericht ein große Bandbreite aktueller Forschungsthemen ab. Dazu zählen unter anderem Self-Adaptive Service-Oriented Systems, Operating System Support for Service-Oriented Systems, Architecture and Modeling of Service-Oriented Systems, Adaptive Process Management, Services Composition and Workflow Planning, Security Engineering of Service-Based IT Systems, Quantitative Analysis and Optimization of Service-Oriented Systems, Service-Oriented Systems in 3D Computer Graphics sowie Service-Oriented Geoinformatics.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 83 
KW  - Hasso-Plattner-Institut
KW  - Forschungskolleg
KW  - Klausurtagung
KW  - Service-oriented Systems Engineering
KW  - Hasso Plattner Institute
KW  - Research School
KW  - Ph.D. Retreat
KW  - Service-oriented Systems Engineering
Y1  - 2014
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus-63490
SN  - 978-3-86956-273-5
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - BOOK
A1  - Meinel, Christoph
A1  - Döllner, Jürgen Roland Friedrich
A1  - Weske, Mathias
A1  - Polze, Andreas
A1  - Hirschfeld, Robert
A1  - Naumann, Felix
A1  - Giese, Holger
A1  - Baudisch, Patrick
A1  - Friedrich, Tobias
A1  - Böttinger, Erwin
A1  - Lippert, Christoph
A1  - Dörr, Christian
A1  - Lehmann, Anja
A1  - Renard, Bernhard
A1  - Rabl, Tilmann
A1  - Uebernickel, Falk
A1  - Arnrich, Bert
A1  - Hölzle, Katharina
T1  - Proceedings of the HPI Research School on Service-oriented Systems Engineering 2020 Fall Retreat
N2  - Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application.

Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns.

The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the research school, this technical report covers a wide range of topics. These include but are not limited to: Human Computer Interaction and Computer Vision as Service; Service-oriented Geovisualization Systems; Algorithm Engineering for Service-oriented Systems; Modeling and Verification of Self-adaptive Service-oriented Systems; Tools and Methods for Software Engineering in Service-oriented Systems; Security Engineering of Service-based IT Systems; Service-oriented Information Systems; Evolutionary Transition of Enterprise Applications to Service Orientation; Operating System Abstractions for Service-oriented Computing; and Services Specification, Composition, and Enactment.
N2  - Der Entwurf und die Realisierung dienstbasierender Architekturen wirft eine Vielzahl von Forschungsfragestellungen aus den Gebieten der Softwaretechnik, der Systemmodellierung und -analyse, sowie der Adaptierbarkeit und Integration von Applikationen auf. Komponentenorientierung und WebServices sind zwei Ansätze für den effizienten Entwurf und die Realisierung komplexer Web-basierender Systeme. Sie ermöglichen die Reaktion auf wechselnde Anforderungen ebenso, wie die Integration großer komplexer Softwaresysteme.

"Service-Oriented Systems Engineering" repräsentiert die Symbiose bewährter Praktiken aus den Gebieten der Objektorientierung, der Komponentenprogrammierung, des verteilten Rechnen sowie der Geschäftsprozesse und berücksichtigt auch die Integration von Geschäftsanliegen und Informationstechnologien.

Die Klausurtagung des Forschungskollegs "Service-oriented Systems Engineering" findet einmal jährlich statt und bietet allen Kollegiaten die Möglichkeit den Stand ihrer aktuellen Forschung darzulegen. Bedingt durch die Querschnittstruktur des Kollegs deckt dieser Bericht ein weites Spektrum aktueller Forschungsthemen ab. Dazu zählen unter anderem Human Computer Interaction and Computer Vision as Service; Service-oriented Geovisualization Systems; Algorithm Engineering for Service-oriented Systems; Modeling and Verification of Self-adaptive Service-oriented Systems; Tools and Methods for Software Engineering in Service-oriented Systems; Security Engineering of Service-based IT Systems; Service-oriented Information Systems; Evolutionary Transition of Enterprise Applications to Service Orientation; Operating System Abstractions for Service-oriented Computing; sowie Services Specification, Composition, and Enactment.
T3  - Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam - 138 
KW  - Hasso Plattner Institute
KW  - research school
KW  - Ph.D. retreat
KW  - service-oriented systems engineering
KW  - Hasso-Plattner-Institut
KW  - Forschungskolleg
KW  - Klausurtagung
KW  - Service-oriented Systems Engineering
Y1  - 2021
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-504132
SN  - 978-3-86956-513-2
SN  - 1613-5652
SN  - 2191-1665
IS  - 138
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  -