TY - JOUR
A1 - Peng, Junjie
A1 - Liu, Danxu
A1 - Wang, Yingtao
A1 - Zeng, Ying
A1 - Cheng, Feng
A1 - Zhang, Wenqiang
T1 - Weight-based strategy for an I/O-intensive application at a cloud data center
JF - Concurrency and computation : practice & experience
N2 - Applications with different characteristics in the cloud may have different resources preferences. However, traditional resource allocation and scheduling strategies rarely take into account the characteristics of applications. Considering that an I/O-intensive application is a typical type of application and that frequent I/O accesses, especially small files randomly accessing the disk, may lead to an inefficient use of resources and reduce the quality of service (QoS) of applications, a weight allocation strategy is proposed based on the available resources that a physical server can provide as well as the characteristics of the applications. Using the weight obtained, a resource allocation and scheduling strategy is presented based on the specific application characteristics in the data center. Extensive experiments show that the strategy is correct and can guarantee a high concurrency of I/O per second (IOPS) in a cloud data center with high QoS. Additionally, the strategy can efficiently improve the utilization of the disk and resources of the data center without affecting the service quality of applications.
KW - IOPS
KW - process scheduling
KW - random I
KW - O
KW - small files
KW - weight
Y1 - 2018
U6 - https://doi.org/10.1002/cpe.4648
SN - 1532-0626
SN - 1532-0634
VL - 30
IS - 19
PB - Wiley
CY - Hoboken
ER -
TY - JOUR
A1 - Kossmann, Jan
A1 - Halfpap, Stefan
A1 - Jankrift, Marcel
A1 - Schlosser, Rainer
T1 - Magic mirror in my hand, which is the best in the land?
BT - an experimental evaluation of index selection algorithms
JF - Proceedings of the VLDB Endowment
N2 - Indexes are essential for the efficient processing of database workloads. Proposed solutions for the relevant and challenging index selection problem range from metadata-based simple heuristics, over sophisticated multi-step algorithms, to approaches that yield optimal results. The main challenges are (i) to accurately determine the effect of an index on the workload cost while considering the interaction of indexes and (ii) a large number of possible combinations resulting from workloads containing many queries and massive schemata with possibly thousands of attributes.
In this work, we describe and analyze eight index selection algorithms that are based on different concepts and compare them along different dimensions, such as solution quality, runtime, multi-column support, solution granularity, and complexity. In particular, we analyze the solutions of the algorithms for the challenging analytical Join Order, TPC-H, and TPC-DS benchmarks. Afterward, we assess strengths and weaknesses, infer insights for index selection in general and each approach individually, before we give recommendations on when to use which approach.
Y1 - 2020
U6 - https://doi.org/10.14778/3407790.3407832
SN - 2150-8097
VL - 13
IS - 11
SP - 2382
EP - 2395
PB - Association for Computing Machinery
CY - New York
ER -
TY - JOUR
A1 - Mattis, Toni
A1 - Beckmann, Tom
A1 - Rein, Patrick
A1 - Hirschfeld, Robert
T1 - First-class concepts
BT - Reified architectural knowledge beyond dominant decompositions
JF - Journal of object technology : JOT / ETH Zürich, Department of Computer Science
N2 - Ideally, programs are partitioned into independently maintainable and understandable modules. As a system grows, its architecture gradually loses the capability to accommodate new concepts in a modular way. While refactoring is expensive and not always possible, and the programming language might lack dedicated primary language constructs to express certain cross-cutting concerns, programmers are still able to explain and delineate convoluted concepts through secondary means: code comments, use of whitespace and arrangement of code, documentation, or communicating tacit knowledge.
Secondary constructs are easy to change and provide high flexibility in communicating cross-cutting concerns and other concepts among programmers. However, such secondary constructs usually have no reified representation that can be explored and manipulated as first-class entities through the programming environment.
In this exploratory work, we discuss novel ways to express a wide range of concepts, including cross-cutting concerns, patterns, and lifecycle artifacts independently of the dominant decomposition imposed by an existing architecture. We propose the representation of concepts as first-class objects inside the programming environment that retain the capability to change as easily as code comments. We explore new tools that allow programmers to view, navigate, and change programs based on conceptual perspectives. In a small case study, we demonstrate how such views can be created and how the programming experience changes from draining programmers' attention by stretching it across multiple modules toward focusing it on cohesively presented concepts. Our designs are geared toward facilitating multiple secondary perspectives on a system to co-exist in symbiosis with the original architecture, hence making it easier to explore, understand, and explain complex contexts and narratives that are hard or impossible to express using primary modularity constructs.
KW - software engineering
KW - modularity
KW - exploratory programming
KW - program
KW - comprehension
KW - remodularization
KW - architecture recovery
Y1 - 2022
U6 - https://doi.org/10.5381/jot.2022.21.2.a6
SN - 1660-1769
VL - 21
IS - 2
SP - 1
EP - 15
PB - ETH Zürich, Department of Computer Science
CY - Zürich
ER -
TY - JOUR
A1 - Koumarelas, Ioannis
A1 - Jiang, Lan
A1 - Naumann, Felix
T1 - Data preparation for duplicate detection
JF - Journal of data and information quality : (JDIQ)
N2 - Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection.
Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
KW - data preparation
KW - data wrangling
KW - record linkage
KW - duplicate detection
KW - similarity measures
Y1 - 2020
U6 - https://doi.org/10.1145/3377878
SN - 1936-1955
SN - 1936-1963
VL - 12
IS - 3
PB - Association for Computing Machinery
CY - New York
ER -
TY - JOUR
A1 - Kossmann, Jan
A1 - Schlosser, Rainer
T1 - Self-driving database systems
BT - a conceptual approach
JF - Distributed and parallel databases
N2 - Challenges for self-driving database systems, which tune their physical design and configuration autonomously, are manifold: Such systems have to anticipate future workloads, find robust configurations efficiently, and incorporate knowledge gained by previous actions into later decisions. We present a component-based framework for self-driving database systems that enables database integration and development of self-managing functionality with low overhead by relying on separation of concerns. By keeping the components of the framework reusable and exchangeable, experiments are simplified, which promotes further research in that area. Moreover, to optimize multiple mutually dependent features, e.g., index selection and compression configurations, we propose a linear programming (LP) based algorithm to derive an efficient tuning order automatically. Afterwards, we demonstrate the applicability and scalability of our approach with reproducible examples.
KW - database systems
KW - self-driving
KW - recursive tuning
KW - workload prediction
KW - robustness
Y1 - 2020
U6 - https://doi.org/10.1007/s10619-020-07288-w
SN - 0926-8782
SN - 1573-7578
VL - 38
IS - 4
SP - 795
EP - 817
PB - Springer
CY - Dordrecht
ER -
TY - JOUR
A1 - Schneider, Johannes
A1 - Wenig, Phillip
A1 - Papenbrock, Thorsten
T1 - Distributed detection of sequential anomalies in univariate time series
JF - The VLDB journal : the international journal on very large data bases
N2 - The automated detection of sequential anomalies in time series is an essential task for many applications, such as the monitoring of technical systems, fraud detection in high-frequency trading, or the early detection of disease symptoms. All these applications require the detection to find all sequential anomalies possibly fast on potentially very large time series. In other words, the detection needs to be effective, efficient and scalable w.r.t. the input size. Series2Graph is an effective solution based on graph embeddings that are robust against re-occurring anomalies and can discover sequential anomalies of arbitrary length and works without training data. Yet, Series2Graph is no t scalable due to its single-threaded approach; it cannot, in particular, process arbitrarily large sequences due to the memory constraints of a single machine. In this paper, we propose our distributed anomaly detection system, short DADS, which is an efficient and scalable adaptation of Series2Graph. Based on the actor programming model, DADS distributes the input time sequence, intermediate state and the computation to all processors of a cluster in a way that minimizes communication costs and synchronization barriers. Our evaluation shows that DADS is orders of magnitude faster than S2G, scales almost linearly with the number of processors in the cluster and can process much larger input sequences due to its scale-out property.
KW - Distributed programming
KW - Sequential anomaly
KW - Actor model
KW - Data mining
KW - Time series
Y1 - 2021
U6 - https://doi.org/10.1007/s00778-021-00657-6
SN - 1066-8888
SN - 0949-877X
VL - 30
IS - 4
SP - 579
EP - 602
PB - Springer
CY - Berlin
ER -
TY - BOOK
A1 - Kunze, Matthias
A1 - Weske, Mathias
T1 - Behavioural Models
BT - From Modelling Finite Automata to Analysing Business Processes
N2 - This textbook introduces the basis for modelling and analysing discrete dynamic systems, such as computer programmes, soft- and hardware systems, and business processes. The underlying concepts are introduced and concrete modelling techniques are described, such as finite automata, state machines, and Petri nets. The concepts are related to concrete application scenarios, among which business processes play a prominent role.
The book consists of three parts, the first of which addresses the foundations of behavioural modelling. After a general introduction to modelling, it introduces transition systems as a basic formalism for representing the behaviour of discrete dynamic systems. This section also discusses causality, a fundamental concept for modelling and reasoning about behaviour. In turn, Part II forms the heart of the book and is devoted to models of behaviour. It details both sequential and concurrent systems and introduces finite automata, state machines and several different types of Petri nets. One chapter is especially devoted to business process models, workflow patterns and BPMN, the industry standard for modelling business processes. Lastly, Part III investigates how the behaviour of systems can be analysed. To this end, it introduces readers to the concept of state spaces. Further chapters cover the comparison of behaviour and the formal analysis and verification of behavioural models.
The book was written for students of computer science and software engineering, as well as for programmers and system analysts interested in the behaviour of the systems they work on. It takes readers on a journey from the fundamentals of behavioural modelling to advanced techniques for modelling and analysing sequential and concurrent systems, and thus provides them a deep understanding of the concepts and techniques introduced and how they can be applied to concrete application scenarios.
Y1 - 2016
SN - 978-3-319-44958-6
PB - Springer
CY - Cham
ER -
TY - GEN
A1 - Hesse, Guenter
A1 - Matthies, Christoph
A1 - Sinzig, Werner
A1 - Uflacker, Matthias
T1 - Adding Value by Combining Business and Sensor Data
BT - an Industry 4.0 Use Case
T2 - Database Systems for Advanced Applications
N2 - Industry 4.0 and the Internet of Things are recent developments that have lead to the creation of new kinds of manufacturing data. Linking this new kind of sensor data to traditional business information is crucial for enterprises to take advantage of the data’s full potential. In this paper, we present a demo which allows experiencing this data integration, both vertically between technical and business contexts and horizontally along the value chain. The tool simulates a manufacturing company, continuously producing both business and sensor data, and supports issuing ad-hoc queries that answer specific questions related to the business. In order to adapt to different environments, users can configure sensor characteristics to their needs.
KW - Industry 4.0
KW - Internet of Things
KW - Data integration
Y1 - 2019
SN - 978-3-030-18590-9
SN - 978-3-030-18589-3
U6 - https://doi.org/10.1007/978-3-030-18590-9_80
SN - 0302-9743
SN - 1611-3349
VL - 11448
SP - 528
EP - 532
PB - Springer
CY - Cham
ER -
TY - JOUR
A1 - Schmidl, Sebastian
A1 - Papenbrock, Thorsten
T1 - Efficient distributed discovery of bidirectional order dependencies
JF - The VLDB journal
N2 - Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets.
KW - Bidirectional order dependencies
KW - Distributed computing
KW - Actor
KW - programming
KW - Parallelization
KW - Data profiling
KW - Dependency discovery
Y1 - 2021
U6 - https://doi.org/10.1007/s00778-021-00683-4
SN - 1066-8888
SN - 0949-877X
VL - 31
IS - 1
SP - 49
EP - 74
PB - Springer
CY - Berlin ; Heidelberg ; New York
ER -
TY - JOUR
A1 - Chauhan, Ankit
A1 - Friedrich, Tobias
A1 - Rothenberger, Ralf
T1 - Greed is good for deterministic scale-free networks
JF - Algorithmica : an international journal in computer science
N2 - Large real-world networks typically follow a power-law degree distribution. To study such networks, numerous random graph models have been proposed. However, real-world networks are not drawn at random. Therefore, Brach et al. (27th symposium on discrete algorithms (SODA), pp 1306-1325, 2016) introduced two natural deterministic conditions: (1) a power-law upper bound on the degree distribution (PLB-U) and (2) power-law neighborhoods, that is, the degree distribution of neighbors of each vertex is also upper bounded by a power law (PLB-N). They showed that many real-world networks satisfy both properties and exploit them to design faster algorithms for a number of classical graph problems. We complement their work by showing that some well-studied random graph models exhibit both of the mentioned PLB properties. PLB-U and PLB-N hold with high probability for Chung-Lu Random Graphs and Geometric Inhomogeneous Random Graphs and almost surely for Hyperbolic Random Graphs. As a consequence, all results of Brach et al. also hold with high probability or almost surely for those random graph classes. In the second part we study three classical NP-hard optimization problems on PLB networks. It is known that on general graphs with maximum degree Delta, a greedy algorithm, which chooses nodes in the order of their degree, only achieves a Omega (ln Delta)-approximation forMinimum Vertex Cover and Minimum Dominating Set, and a Omega(Delta)-approximation forMaximum Independent Set. We prove that the PLB-U property with beta>2 suffices for the greedy approach to achieve a constant-factor approximation for all three problems. We also show that these problems are APX-hard even if PLB-U, PLB-N, and an additional power-law lower bound on the degree distribution hold. Hence, a PTAS cannot be expected unless P = NP. Furthermore, we prove that all three problems are in MAX SNP if the PLB-U property holds.
KW - random graphs
KW - deterministic properties
KW - power-law
KW - approximation
KW - APX-hardness
Y1 - 2020
U6 - https://doi.org/10.1007/s00453-020-00729-z
SN - 0178-4617
SN - 1432-0541
VL - 82
IS - 11
SP - 3338
EP - 3389
PB - Springer
CY - New York
ER -