TY - JOUR
A1 - Kossmann, Jan
A1 - Halfpap, Stefan
A1 - Jankrift, Marcel
A1 - Schlosser, Rainer
T1 - Magic mirror in my hand, which is the best in the land?
BT - an experimental evaluation of index selection algorithms
JF - Proceedings of the VLDB Endowment
N2 - Indexes are essential for the efficient processing of database workloads. Proposed solutions for the relevant and challenging index selection problem range from metadata-based simple heuristics, over sophisticated multi-step algorithms, to approaches that yield optimal results. The main challenges are (i) to accurately determine the effect of an index on the workload cost while considering the interaction of indexes and (ii) a large number of possible combinations resulting from workloads containing many queries and massive schemata with possibly thousands of attributes.
In this work, we describe and analyze eight index selection algorithms that are based on different concepts and compare them along different dimensions, such as solution quality, runtime, multi-column support, solution granularity, and complexity. In particular, we analyze the solutions of the algorithms for the challenging analytical Join Order, TPC-H, and TPC-DS benchmarks. Afterward, we assess strengths and weaknesses, infer insights for index selection in general and each approach individually, before we give recommendations on when to use which approach.
Y1 - 2020
U6 - https://doi.org/10.14778/3407790.3407832
SN - 2150-8097
VL - 13
IS - 11
SP - 2382
EP - 2395
PB - Association for Computing Machinery
CY - New York
ER -
TY - JOUR
A1 - Koumarelas, Ioannis
A1 - Jiang, Lan
A1 - Naumann, Felix
T1 - Data preparation for duplicate detection
JF - Journal of data and information quality : (JDIQ)
N2 - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.
Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW - data preparation
KW - data wrangling
KW - record linkage
KW - duplicate detection
KW - similarity measures
Y1 - 2020
U6 - https://doi.org/10.1145/3377878
SN - 1936-1955
SN - 1936-1963
VL - 12
IS - 3
PB - Association for Computing Machinery
CY - New York
ER -
TY - JOUR
A1 - Kossmann, Jan
A1 - Schlosser, Rainer
T1 - Self-driving database systems
BT - a conceptual approach
JF - Distributed and parallel databases
N2 - Challenges for self-driving database systems, which tune their physical design and configuration autonomously, are manifold: Such systems have to anticipate future workloads, find robust configurations efficiently, and incorporate knowledge gained by previous actions into later decisions. We present a component-based framework for self-driving database systems that enables database integration and development of self-managing functionality with low overhead by relying on separation of concerns. By keeping the components of the framework reusable and exchangeable, experiments are simplified, which promotes further research in that area. Moreover, to optimize multiple mutually dependent features, e.g., index selection and compression configurations, we propose a linear programming (LP) based algorithm to derive an efficient tuning order automatically. Afterwards, we demonstrate the applicability and scalability of our approach with reproducible examples.
KW - database systems
KW - self-driving
KW - recursive tuning
KW - workload prediction
KW - robustness
Y1 - 2020
U6 - https://doi.org/10.1007/s10619-020-07288-w
SN - 0926-8782
SN - 1573-7578
VL - 38
IS - 4
SP - 795
EP - 817
PB - Springer
CY - Dordrecht
ER -
TY - JOUR
A1 - Chauhan, Ankit
A1 - Friedrich, Tobias
A1 - Rothenberger, Ralf
T1 - Greed is good for deterministic scale-free networks
JF - Algorithmica : an international journal in computer science
N2 - Large real-world networks typically follow a power-law degree distribution. To study such networks, numerous random graph models have been proposed. However, real-world networks are not drawn at random. Therefore, Brach et al. (27th symposium on discrete algorithms (SODA), pp 1306-1325, 2016) introduced two natural deterministic conditions: (1) a power-law upper bound on the degree distribution (PLB-U) and (2) power-law neighborhoods, that is, the degree distribution of neighbors of each vertex is also upper bounded by a power law (PLB-N). They showed that many real-world networks satisfy both properties and exploit them to design faster algorithms for a number of classical graph problems. We complement their work by showing that some well-studied random graph models exhibit both of the mentioned PLB properties. PLB-U and PLB-N hold with high probability for Chung-Lu Random Graphs and Geometric Inhomogeneous Random Graphs and almost surely for Hyperbolic Random Graphs. As a consequence, all results of Brach et al. also hold with high probability or almost surely for those random graph classes. In the second part we study three classical NP-hard optimization problems on PLB networks. It is known that on general graphs with maximum degree Delta, a greedy algorithm, which chooses nodes in the order of their degree, only achieves a Omega (ln Delta)-approximation forMinimum Vertex Cover and Minimum Dominating Set, and a Omega(Delta)-approximation forMaximum Independent Set. We prove that the PLB-U property with beta>2 suffices for the greedy approach to achieve a constant-factor approximation for all three problems. We also show that these problems are APX-hard even if PLB-U, PLB-N, and an additional power-law lower bound on the degree distribution hold. Hence, a PTAS cannot be expected unless P = NP. Furthermore, we prove that all three problems are in MAX SNP if the PLB-U property holds.
KW - random graphs
KW - deterministic properties
KW - power-law
KW - approximation
KW - APX-hardness
Y1 - 2020
U6 - https://doi.org/10.1007/s00453-020-00729-z
SN - 0178-4617
SN - 1432-0541
VL - 82
IS - 11
SP - 3338
EP - 3389
PB - Springer
CY - New York
ER -
TY - JOUR
A1 - Barkowsky, Matthias
A1 - Giese, Holger
T1 - Hybrid search plan generation for generalized graph pattern matching
JF - Journal of logical and algebraic methods in programming
N2 - In recent years, the increased interest in application areas such as social networks has resulted in a rising popularity of graph-based approaches for storing and processing large amounts of interconnected data. To extract useful information from the growing network structures, efficient querying techniques are required.
In this paper, we propose an approach for graph pattern matching that allows a uniform handling of arbitrary constraints over the query vertices. Our technique builds on a previously introduced matching algorithm, which takes concrete host graph information into account to dynamically adapt the employed search plan during query execution. The dynamic algorithm is combined with an existing static approach for search plan generation, resulting in a hybrid technique which we further extend by a more sophisticated handling of filtering effects caused by constraint checks. We evaluate the presented concepts empirically based on an implementation for our graph pattern matching tool, the Story Diagram Interpreter, with queries and data provided by the LDBC Social Network Benchmark. Our results suggest that the hybrid technique may improve search efficiency in several cases, and rarely reduces efficiency.
KW - graph pattern matching
KW - search plan generation
Y1 - 2020
U6 - https://doi.org/10.1016/j.jlamp.2020.100563
SN - 2352-2208
VL - 114
PB - Elsevier
CY - New York
ER -
TY - JOUR
A1 - Hacker, Philipp
A1 - Krestel, Ralf
A1 - Grundmann, Stefan
A1 - Naumann, Felix
T1 - Explainable AI under contract and tort law
BT - legal incentives and technical challenges
JF - Artificial intelligence and law
N2 - This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.
KW - explainability
KW - explainable AI
KW - interpretable machine learning
KW - contract
KW - law
KW - tort law
KW - explainability-accuracy trade-off
KW - medical malpractice
KW - corporate takeovers
Y1 - 2020
U6 - https://doi.org/10.1007/s10506-020-09260-6
SN - 0924-8463
SN - 1572-8382
VL - 28
IS - 4
SP - 415
EP - 439
PB - Springer
CY - Dordrecht
ER -
TY - JOUR
A1 - Lambers, Leen
A1 - Weber, Jens
T1 - Preface to the special issue on the 11th International Conference on Graph Transformation
JF - Journal of Logical and Algebraic Methods in Programming
N2 - This special issue contains extended versions of four selected papers from the 11th International Conference on Graph Transformation (ICGT 2018). The articles cover a tool for computing core graphs via SAT/SMT solvers (graph language definition), graph transformation through graph surfing in reaction systems (a new graph transformation formalism), the essence and initiality of conflicts in M-adhesive transformation systems, and a calculus of concurrent graph-rewriting processes (theory on conflicts and parallel independence).
KW - graph transformation
KW - graph languages
KW - conflicts and dependencies in
KW - concurrent graph rewriting
Y1 - 2020
U6 - https://doi.org/10.1016/j.jlamp.2020.100525
SN - 2352-2208
VL - 112
PB - Elsevier
CY - Amsterdam
ER -
TY - JOUR
A1 - Ghahremani, Sona
A1 - Giese, Holger
T1 - Evaluation of self-healing systems
BT - An analysis of the state-of-the-art and required improvements
JF - Computers
N2 - Evaluating the performance of self-adaptive systems is challenging due to their interactions with often highly dynamic environments. In the specific case of self-healing systems, the performance evaluations of self-healing approaches and their parameter tuning rely on the considered characteristics of failure occurrences and the resulting interactions with the self-healing actions. In this paper, we first study the state-of-the-art for evaluating the performances of self-healing systems by means of a systematic literature review. We provide a classification of different input types for such systems and analyse the limitations of each input type. A main finding is that the employed inputs are often not sophisticated regarding the considered characteristics for failure occurrences. To further study the impact of the identified limitations, we present experiments demonstrating that wrong assumptions regarding the characteristics of the failure occurrences can result in large performance prediction errors, disadvantageous design-time decisions concerning the selection of alternative self-healing approaches, and disadvantageous deployment-time decisions concerning parameter tuning. Furthermore, the experiments indicate that employing multiple alternative input characteristics can help with reducing the risk of premature disadvantageous design-time decisions.
KW - self-healing
KW - failure model
KW - performance
KW - simulation
KW - evaluation
Y1 - 2020
U6 - https://doi.org/10.3390/computers9010016
SN - 2073-431X
VL - 9
IS - 1
PB - MDPI
CY - Basel
ER -
TY - JOUR
A1 - Torkura, Kennedy A.
A1 - Sukmana, Muhammad Ihsan Haikal
A1 - Cheng, Feng
A1 - Meinel, Christoph
T1 - CloudStrike
BT - chaos engineering for security and resiliency in cloud infrastructure
JF - IEEE access : practical research, open solutions
N2 - Most cyber-attacks and data breaches in cloud infrastructure are due to human errors and misconfiguration vulnerabilities. Cloud customer-centric tools are imperative for mitigating these issues, however existing cloud security models are largely unable to tackle these security challenges. Therefore, novel security mechanisms are imperative, we propose Risk-driven Fault Injection (RDFI) techniques to address these challenges. RDFI applies the principles of chaos engineering to cloud security and leverages feedback loops to execute, monitor, analyze and plan security fault injection campaigns, based on a knowledge-base. The knowledge-base consists of fault models designed from secure baselines, cloud security best practices and observations derived during iterative fault injection campaigns. These observations are helpful for identifying vulnerabilities while verifying the correctness of security attributes (integrity, confidentiality and availability). Furthermore, RDFI proactively supports risk analysis and security hardening efforts by sharing security information with security mechanisms. We have designed and implemented the RDFI strategies including various chaos engineering algorithms as a software tool: CloudStrike. Several evaluations have been conducted with CloudStrike against infrastructure deployed on two major public cloud infrastructure: Amazon Web Services and Google Cloud Platform. The time performance linearly increases, proportional to increasing attack rates. Also, the analysis of vulnerabilities detected via security fault injection has been used to harden the security of cloud resources to demonstrate the effectiveness of the security information provided by CloudStrike. Therefore, we opine that our approaches are suitable for overcoming contemporary cloud security issues.
KW - cloud security
KW - security chaos engineering
KW - resilient architectures
KW - security risk assessment
Y1 - 2020
U6 - https://doi.org/10.1109/ACCESS.2020.3007338
SN - 2169-3536
VL - 8
SP - 123044
EP - 123060
PB - Institute of Electrical and Electronics EngineersĀ
CY - Piscataway
ER -
TY - JOUR
A1 - Kaitoua, Abdulrahman
A1 - Rabl, Tilmann
A1 - Markl, Volker
T1 - A distributed data exchange engine for polystores
JF - Information technology : methods and applications of informatics and information technology
JF - Information technology : Methoden und innovative Anwendungen der Informatik und Informationstechnik
N2 - There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary.
Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines' instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.
KW - distributed systems
KW - data migration
KW - data transformation
KW - big data
KW - engine
KW - data integration
Y1 - 2020
U6 - https://doi.org/10.1515/itit-2019-0037
SN - 1611-2776
SN - 2196-7032
VL - 62
IS - 3-4
SP - 145
EP - 156
PB - De Gruyter
CY - Berlin
ER -