004 Datenverarbeitung; Informatik
Refine
Year of publication
Document Type
- Article (40)
- Monograph/Edited Volume (39)
- Doctoral Thesis (33)
- Postprint (5)
- Conference Proceeding (2)
- Part of a Book (1)
- Other (1)
Keywords
- machine learning (8)
- Smalltalk (5)
- maschinelles Lernen (5)
- cyber-physical systems (4)
- digital education (4)
- openHPI (4)
- probabilistic timed systems (4)
- qualitative Analyse (4)
- qualitative analysis (4)
- quantitative Analyse (4)
Institute
- Hasso-Plattner-Institut für Digital Engineering GmbH (121) (remove)
The transversal hypergraph problem asks to enumerate the minimal hitting sets of a hypergraph. If the solutions have bounded size, Eiter and Gottlob [SICOMP'95] gave an algorithm running in output-polynomial time, but whose space requirement also scales with the output. We improve this to polynomial delay and space. Central to our approach is the extension problem, deciding for a set X of vertices whether it is contained in any minimal hitting set. We show that this is one of the first natural problems to be W[3]-complete. We give an algorithm for the extension problem running in time O(m(vertical bar X vertical bar+1) n) and prove a SETH-lower bound showing that this is close to optimal. We apply our enumeration method to the discovery problem of minimal unique column combinations from data profiling. Our empirical evaluation suggests that the algorithm outperforms its worst-case guarantees on hypergraphs stemming from real-world databases.
Viper
(2021)
Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4-18x for write workloads, while matching or surpassing their get performance.
Invention
(2023)
This entry addresses invention from five different perspectives: (i) definition of the term, (ii) mechanisms underlying invention processes, (iii) (pre-)history of human inventions, (iv) intellectual property protection vs open innovation, and (v) case studies of great inventors. Regarding the definition, an invention is the outcome of a creative process taking place within a technological milieu, which is recognized as successful in terms of its effectiveness as an original technology. In the process of invention, a technological possibility becomes realized. Inventions are distinct from either discovery or innovation. In human creative processes, seven mechanisms of invention can be observed, yielding characteristic outcomes: (1) basic inventions, (2) invention branches, (3) invention combinations, (4) invention toolkits, (5) invention exaptations, (6) invention values, and (7) game-changing inventions. The development of humanity has been strongly shaped by inventions ever since early stone tools and the conception of agriculture. An “explosion of creativity” has been associated with Homo sapiens, and inventions in all fields of human endeavor have followed suit, engendering an exponential growth of cumulative culture. This culture development emerges essentially through a reuse of previous inventions, their revision, amendment and rededication. In sociocultural terms, humans have increasingly regulated processes of invention and invention-reuse through concepts such as intellectual property, patents, open innovation and licensing methods. Finally, three case studies of great inventors are considered: Edison, Marconi, and Montessori, next to a discussion of human invention processes as collaborative endeavors.
CovRadar
(2022)
The ongoing pandemic caused by SARS-CoV-2 emphasizes the importance of genomic surveillance to understand the evolution of the virus, to monitor the viral population, and plan epidemiological responses. Detailed analysis, easy visualization and intuitive filtering of the latest viral sequences are powerful for this purpose. We present CovRadar, a tool for genomic surveillance of the SARS-CoV-2 Spike protein. CovRadar consists of an analytical pipeline and a web application that enable the analysis and visualization of hundreds of thousand sequences. First, CovRadar extracts the regions of interest using local alignment, then builds a multiple sequence alignment, infers variants and consensus and finally presents the results in an interactive app, making accessing and reporting simple, flexible and fast.
Graphs play an important role in many areas of Computer Science. In particular, our work is motivated by model-driven software development and by graph databases. For this reason, it is very important to have the means to express and to reason about the properties that a given graph may satisfy. With this aim, in this paper we present a visual logic that allows us to describe graph properties, including navigational properties, i.e., properties about the paths in a graph. The logic is equipped with a deductive tableau method that we have proved to be sound and complete.
As resources are valuable assets, organizations have to decide which resources to allocate to business process tasks in a way that the process is executed not only effectively but also efficiently. Traditional role-based resource allocation leads to effective process executions, since each task is performed by a resource that has the required skills and competencies to do so. However, the resulting allocations are typically not as efficient as they could be, since optimization techniques have yet to find their way in traditional business process management scenarios. On the other hand, operations research provides a rich set of analytical methods for supporting problem-specific decisions on resource allocation. This paper provides a novel framework for creating transparency on existing tasks and resources, supporting individualized allocations for each activity in a process, and the possibility to integrate problem-specific analytical methods of the operations research domain. To validate the framework, the paper reports on the design and prototypical implementation of a software architecture, which extends a traditional process engine with a dedicated resource management component. This component allows us to define specific resource allocation problems at design time, and it also facilitates optimized resource allocation at run time. The framework is evaluated using a real-world parcel delivery process. The evaluation shows that the quality of the allocation results increase significantly with a technique from operations research in contrast to the traditional applied rule-based approach.
In the field of Business Process Management (BPM), modeling business processes and related data is a critical issue since process activities need to manage data stored in databases. The connection between processes and data is usually handled at the implementation level, even if modeling both processes and data at the conceptual level should help designers in improving business process models and identifying requirements for implementation. Especially in data -and decision-intensive contexts, business process activities need to access data stored both in databases and data warehouses. In this paper, we complete our approach for defining a novel conceptual view that bridges process activities and data. The proposed approach allows the designer to model the connection between business processes and database models and define the operations to perform, providing interesting insights on the overall connected perspective and hints for identifying activities that are crucial for decision support.
Despite advances in machine learning-based clinical prediction models, only few of such models are actually deployed in clinical contexts. Among other reasons, this is due to a lack of validation studies. In this paper, we present and discuss the validation results of a machine learning model for the prediction of acute kidney injury in cardiac surgery patients initially developed on the MIMIC-III dataset when applied to an external cohort of an American research hospital. To help account for the performance differences observed, we utilized interpretability methods based on feature importance, which allowed experts to scrutinize model behavior both at the global and local level, making it possible to gain further insights into why it did not behave as expected on the validation cohort. The knowledge gleaned upon derivation can be potentially useful to assist model update during validation for more generalizable and simpler models. We argue that interpretability methods should be considered by practitioners as a further tool to help explain performance differences and inform model update in validation studies.
To manage tabular data files and leverage their content in a given downstream task, practitioners often design and execute complex transformation pipelines to prepare them. The complexity of such pipelines stems from different factors, including the nature of the preparation tasks, often exploratory or ad-hoc to specific datasets; the large repertory of tools, algorithms, and frameworks that practitioners need to master; and the volume, variety, and velocity of the files to be prepared. Metadata plays a fundamental role in reducing this complexity: characterizing a file assists end users in the design of data preprocessing pipelines, and furthermore paves the way for suggestion, automation, and optimization of data preparation tasks.
Previous research in the areas of data profiling, data integration, and data cleaning, has focused on extracting and characterizing metadata regarding the content of tabular data files, i.e., about the records and attributes of tables. Content metadata are useful for the latter stages of a preprocessing pipeline, e.g., error correction, duplicate detection, or value normalization, but they require a properly formed tabular input. Therefore, these metadata are not relevant for the early stages of a preparation pipeline, i.e., to correctly parse tables out of files. In this dissertation, we turn our focus to what we call the structure of a tabular data file, i.e., the set of characters within a file that do not represent data values but are required to parse and understand the content of the file. We provide three different approaches to represent file structure, an explicit representation based on context-free grammars; an implicit representation based on file-wise similarity; and a learned representation based on machine learning.
In our first contribution, we use the grammar-based representation to characterize a set of over 3000 real-world csv files and identify multiple structural issues that let files deviate from the csv standard, e.g., by having inconsistent delimiters or containing multiple tables. We leverage our learnings about real-world files and propose Pollock, a benchmark to test how well systems parse csv files that have a non-standard structure, without any previous preparation. We report on our experiments on using Pollock to evaluate the performance of 16 real-world data management systems.
Following, we characterize the structure of files implicitly, by defining a measure of structural similarity for file pairs. We design a novel algorithm to compute this measure, which is based on a graph representation of the files' content. We leverage this algorithm and propose Mondrian, a graphical system to assist users in identifying layout templates in a dataset, classes of files that have the same structure, and therefore can be prepared by applying the same preparation pipeline.
Finally, we introduce MaGRiTTE, a novel architecture that uses self-supervised learning to automatically learn structural representations of files in the form of vectorial embeddings at three different levels: cell level, row level, and file level. We experiment with the application of structural embeddings for several tasks, namely dialect detection, row classification, and data preparation efforts estimation.
Our experimental results show that structural metadata, either identified explicitly on parsing grammars, derived implicitly as file-wise similarity, or learned with the help of machine learning architectures, is fundamental to automate several tasks, to scale up preparation to large quantities of files, and to provide repeatable preparation pipelines.
ReadBouncer
(2022)
Motivation:
Nanopore sequencers allow targeted sequencing of interesting nucleotide sequences by rejecting other sequences from individual pores. This feature facilitates the enrichment of low-abundant sequences by depleting overrepresented ones in-silico. Existing tools for adaptive sampling either apply signal alignment, which cannot handle human-sized reference sequences, or apply read mapping in sequence space relying on fast graphical processing units (GPU) base callers for real-time read rejection. Using nanopore long-read mapping tools is also not optimal when mapping shorter reads as usually analyzed in adaptive sampling applications.
Results:
Here, we present a new approach for nanopore adaptive sampling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters. ReadBouncer improves the potential enrichment of low abundance sequences by its high read classification sensitivity and specificity, outperforming existing tools in the field. It robustly removes even reads belonging to large reference sequences while running on commodity hardware without GPUs, making adaptive sampling accessible for in-field researchers. Readbouncer also provides a user-friendly interface and installer files for end-users without a bioinformatics background.
Based on the performance requirements of modern spatio-temporal data mining applications, in-memory database systems are often used to store and process the data. To efficiently utilize the scarce DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint (e.g., data compression) or increase performance (e.g., additional indexes). However, the selection of cost and performance balancing configurations is challenging due to the vast number of possible setups consisting of mutually dependent individual decisions. In this paper, we introduce a novel approach to jointly optimize the compression, sorting, indexing, and tiering configuration for spatio-temporal workloads. Further, we consider horizontal data partitioning, which enables the independent application of different tuning options on a fine-grained level. We propose different linear programming (LP) models addressing cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload and memory budgets. To yield maintainable and robust configurations, we extend our LP-based approach to incorporate reconfiguration costs as well as a worst-case optimization for potential workload scenarios. Further, we demonstrate on a real-world dataset that our models allow to significantly reduce the memory footprint with equal performance or increase the performance with equal memory size compared to existing tuning heuristics.
Advancements in computer vision techniques driven by machine learning have facilitated robust and efficient estimation of attributes such as depth, optical flow, albedo, and shading. To encapsulate all such underlying properties associated with images and videos, we evolve the concept of intrinsic images towards intrinsic attributes. Further, rapid hardware growth in the form of high-quality smartphone cameras, readily available depth sensors, mobile GPUs, or dedicated neural processing units have made image and video processing pervasive. In this thesis, we explore the synergies between the above two advancements and propose novel image and video processing techniques and systems based on them. To begin with, we investigate intrinsic image decomposition approaches and analyze how they can be implemented on mobile devices. We propose an approach that considers not only diffuse reflection but also specular reflection; it allows us to decompose an image into specularity, albedo, and shading on a resource constrained system (e.g., smartphones or tablets) using the depth data provided by the built-in depth sensors. In addition, we explore how on-device depth data can further be used to add an immersive dimension to 2D photos, e.g., showcasing parallax effects via 3D photography. In this regard, we develop a novel system for interactive 3D photo generation and stylization on mobile devices. Further, we investigate how adaptive manipulation of baseline-albedo (i.e., chromaticity) can be used for efficient visual enhancement under low-lighting conditions. The proposed technique allows for interactive editing of enhancement settings while achieving improved quality and performance. We analyze the inherent optical flow and temporal noise as intrinsic properties of a video. We further propose two new techniques for applying the above intrinsic attributes for the purpose of consistent video filtering. To this end, we investigate how to remove temporal inconsistencies perceived as flickering artifacts. One of the techniques does not require costly optical flow estimation, while both provide interactive consistency control. Using intrinsic attributes for image and video processing enables new solutions for mobile devices – a pervasive visual computing device – and will facilitate novel applications for Augmented Reality (AR), 3D photography, and video stylization. The proposed low-light enhancement techniques can also improve the accuracy of high-level computer vision tasks (e.g., face detection) under low-light conditions. Finally, our approach for consistent video filtering can extend a wide range of image-based processing for videos.
This technical report presents the results of student projects which were prepared during the lecture “Operating Systems II” offered by the “Operating Systems and Middleware” group at HPI in the Summer term of 2020. The lecture covered ad- vanced aspects of operating system implementation and architecture on topics such as Virtualization, File Systems and Input/Output Systems. In addition to attending the lecture, the participating students were encouraged to gather practical experience by completing a project on a closely related topic over the course of the semester. The results of 10 selected exceptional projects are covered in this report.
The students have completed hands-on projects on the topics of Operating System Design Concepts and Implementation, Hardware/Software Co-Design, Reverse Engineering, Quantum Computing, Static Source-Code Analysis, Operating Systems History, Application Binary Formats and more. It should be recognized that over the course of the semester all of these projects have achieved outstanding results which went far beyond the scope and the expec- tations of the lecture, and we would like to thank all participating students for their commitment and their effort in completing their respective projects, as well as their work on compiling this report.
Digital technology offers significant political, economic, and societal opportunities. At the same time, the notion of digital sovereignty has become a leitmotif in German discourse: the state’s capacity to assume its responsibilities and safeguard society’s – and individuals’ – ability to shape the digital transformation in a self-determined way. The education sector is exemplary for the challenge faced by Germany, and indeed Europe, of harnessing the benefits of digital technology while navigating concerns around sovereignty. It encompasses education as a core public good, a rapidly growing field of business, and growing pools of highly sensitive personal data. The report describes pathways to mitigating the tension between digitalization and sovereignty at three different levels – state, economy, and individual – through the lens of concrete technical projects in the education sector: the HPI Schul-Cloud (state sovereignty), the MERLOT data spaces (economic sovereignty), and the openHPI platform (individual sovereignty).
The Security Operations Center (SOC) represents a specialized unit responsible for managing security within enterprises. To aid in its responsibilities, the SOC relies heavily on a Security Information and Event Management (SIEM) system that functions as a centralized repository for all security-related data, providing a comprehensive view of the organization's security posture. Due to the ability to offer such insights, SIEMS are considered indispensable tools facilitating SOC functions, such as monitoring, threat detection, and incident response.
Despite advancements in big data architectures and analytics, most SIEMs fall short of keeping pace. Architecturally, they function merely as log search engines, lacking the support for distributed large-scale analytics. Analytically, they rely on rule-based correlation, neglecting the adoption of more advanced data science and machine learning techniques.
This thesis first proposes a blueprint for next-generation SIEM systems that emphasize distributed processing and multi-layered storage to enable data mining at a big data scale. Next, with the architectural support, it introduces two data mining approaches for advanced threat detection as part of SOC operations.
First, a novel graph mining technique that formulates threat detection within the SIEM system as a large-scale graph mining and inference problem, built on the principles of guilt-by-association and exempt-by-reputation. The approach entails the construction of a Heterogeneous Information Network (HIN) that models shared characteristics and associations among entities extracted from SIEM-related events/logs. Thereon, a novel graph-based inference algorithm is used to infer a node's maliciousness score based on its associations with other entities in the HIN. Second, an innovative outlier detection technique that imitates a SOC analyst's reasoning process to find anomalies/outliers. The approach emphasizes explainability and simplicity, achieved by combining the output of simple context-aware univariate submodels that calculate an outlier score for each entry.
Both approaches were tested in academic and real-world settings, demonstrating high performance when compared to other algorithms as well as practicality alongside a large enterprise's SIEM system.
This thesis establishes the foundation for next-generation SIEM systems that can enhance today's SOCs and facilitate the transition from human-centric to data-driven security operations.
Digitale Technologien bieten erhebliche politische, wirtschaftliche und gesellschaftliche Chancen. Zugleich ist der Begriff digitale Souveränität zu einem Leitmotiv im deutschen Diskurs über digitale Technologien geworden: das heißt, die Fähigkeit des Staates, seine Verantwortung wahrzunehmen und die Befähigung der Gesellschaft – und des Einzelnen – sicherzustellen, die digitale Transformation selbstbestimmt zu gestalten. Exemplarisch für die Herausforderung in Deutschland und Europa, die Vorteile digitaler Technologien zu nutzen und gleichzeitig Souveränitätsbedenken zu berücksichtigen, steht der Bildungssektor. Er umfasst Bildung als zentrales öffentliches Gut, ein schnell aufkommendes Geschäftsfeld und wachsende Bestände an hochsensiblen personenbezogenen Daten. Davon ausgehend beschreibt der Bericht Wege zur Entschärfung des Spannungsverhältnisses zwischen Digitalisierung und Souveränität auf drei verschiedenen Ebenen – Staat, Wirtschaft und Individuum – anhand konkreter technischer Projekte im Bildungsbereich: die HPI Schul-Cloud (staatliche Souveränität), die MERLOT-Datenräume (wirtschaftliche Souveränität) und die openHPI-Plattform (individuelle Souveränität).
In model-driven engineering, the adaptation of large software systems with dynamic structure is enabled by architectural runtime models. Such a model represents an abstract state of the system as a graph of interacting components. Every relevant change in the system is mirrored in the model and triggers an evaluation of model queries, which search the model for structural patterns that should be adapted. This thesis focuses on a type of runtime models where the expressiveness of the model and model queries is extended to capture past changes and their timing. These history-aware models and temporal queries enable more informed decision-making during adaptation, as they support the formulation of requirements on the evolution of the pattern that should be adapted. However, evaluating temporal queries during adaptation poses significant challenges. First, it implies the capability to specify and evaluate requirements on the structure, as well as the ordering and timing in which structural changes occur. Then, query answers have to reflect that the history-aware model represents the architecture of a system whose execution may be ongoing, and thus answers may depend on future changes. Finally, query evaluation needs to be adequately fast and memory-efficient despite the increasing size of the history---especially for models that are altered by numerous, rapid changes.
The thesis presents a query language and a querying approach for the specification and evaluation of temporal queries. These contributions aim to cope with the challenges of evaluating temporal queries at runtime, a prerequisite for history-aware architectural monitoring and adaptation which has not been systematically treated by prior model-based solutions. The distinguishing features of our contributions are: the specification of queries based on a temporal logic which encodes structural patterns as graphs; the provision of formally precise query answers which account for timing constraints and ongoing executions; the incremental evaluation which avoids the re-computation of query answers after each change; and the option to discard history that is no longer relevant to queries. The query evaluation searches the model for occurrences of a pattern whose evolution satisfies a temporal logic formula. Therefore, besides model-driven engineering, another related research community is runtime verification. The approach differs from prior logic-based runtime verification solutions by supporting the representation and querying of structure via graphs and graph queries, respectively, which is more efficient for queries with complex patterns. We present a prototypical implementation of the approach and measure its speed and memory consumption in monitoring and adaptation scenarios from two application domains, with executions of an increasing size. We assess scalability by a comparison to the state-of-the-art from both related research communities. The implementation yields promising results, which pave the way for sophisticated history-aware self-adaptation solutions and indicate that the approach constitutes a highly effective technique for runtime monitoring on an architectural level.
Most machine learning methods provide only point estimates when being queried to predict on new data. This is problematic when the data is corrupted by noise, e.g. from imperfect measurements, or when the queried data point is very different to the data that the machine learning model has been trained with. Probabilistic modelling in machine learning naturally equips predictions with corresponding uncertainty estimates which allows a practitioner to incorporate information about measurement noise into the modelling process and to know when not to trust the predictions. A well-understood, flexible probabilistic framework is provided by Gaussian processes that are ideal as building blocks of probabilistic models. They lend themself naturally to the problem of regression, i.e., being given a set of inputs and corresponding observations and then predicting likely observations for new unseen inputs, and can also be adapted to many more machine learning tasks. However, exactly inferring the optimal parameters of such a Gaussian process model (in a computationally tractable manner) is only possible for regression tasks in small data regimes. Otherwise, approximate inference methods are needed, the most prominent of which is variational inference.
In this dissertation we study models that are composed of Gaussian processes embedded in other models in order to make those more flexible and/or probabilistic. The first example are deep Gaussian processes which can be thought of as a small network of Gaussian processes and which can be employed for flexible regression. The second model class that we study are Gaussian process state-space models. These can be used for time-series modelling, i.e., the task of being given a stream of data ordered by time and then predicting future observations. For both model classes the state-of-the-art approaches offer a trade-off between expressive models and computational properties (e.g. speed or convergence properties) and mostly employ variational inference. Our goal is to improve inference in both models by first getting a deep understanding of the existing methods and then, based on this, to design better inference methods. We achieve this by either exploring the existing trade-offs or by providing general improvements applicable to multiple methods.
We first provide an extensive background, introducing Gaussian processes and their sparse (approximate and efficient) variants. We continue with a description of the models under consideration in this thesis, deep Gaussian processes and Gaussian process state-space models, including detailed derivations and a theoretical comparison of existing methods.
Then we start analysing deep Gaussian processes more closely: Trading off the properties (good optimisation versus expressivity) of state-of-the-art methods in this field, we propose a new variational inference based approach. We then demonstrate experimentally that our new algorithm leads to better calibrated uncertainty estimates than existing methods.
Next, we turn our attention to Gaussian process state-space models, where we closely analyse the theoretical properties of existing methods.The understanding gained in this process leads us to propose a new inference scheme for general Gaussian process state-space models that incorporates effects on multiple time scales. This method is more efficient than previous approaches for long timeseries and outperforms its comparison partners on data sets in which effects on multiple time scales (fast and slowly varying dynamics) are present.
Finally, we propose a new inference approach for Gaussian process state-space models that trades off the properties of state-of-the-art methods in this field. By combining variational inference with another approximate inference method, the Laplace approximation, we design an efficient algorithm that outperforms its comparison partners since it achieves better calibrated uncertainties.
Proceedings of the HPI Research School on Service-oriented Systems Engineering 2020 Fall Retreat
(2021)
Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application.
Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns.
The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the research school, this technical report covers a wide range of topics. These include but are not limited to: Human Computer Interaction and Computer Vision as Service; Service-oriented Geovisualization Systems; Algorithm Engineering for Service-oriented Systems; Modeling and Verification of Self-adaptive Service-oriented Systems; Tools and Methods for Software Engineering in Service-oriented Systems; Security Engineering of Service-based IT Systems; Service-oriented Information Systems; Evolutionary Transition of Enterprise Applications to Service Orientation; Operating System Abstractions for Service-oriented Computing; and Services Specification, Composition, and Enactment.
Learning the causal structures from observational data is an omnipresent challenge in data science. The amount of observational data available to Causal Structure Learning (CSL) algorithms is increasing as data is collected at high frequency from many data sources nowadays. While processing more data generally yields higher accuracy in CSL, the concomitant increase in the runtime of CSL algorithms hinders their widespread adoption in practice. CSL is a parallelizable problem. Existing parallel CSL algorithms address execution on multi-core Central Processing Units (CPUs) with dozens of compute cores. However, modern computing systems are often heterogeneous and equipped with Graphics Processing Units (GPUs) to accelerate computations. Typically, these GPUs provide several thousand compute cores for massively parallel data processing.
To shorten the runtime of CSL algorithms, we design efficient execution strategies that leverage the parallel processing power of GPUs. Particularly, we derive GPU-accelerated variants of a well-known constraint-based CSL method, the PC algorithm, as it allows choosing a statistical Conditional Independence test (CI test) appropriate to the observational data characteristics.
Our two main contributions are: (1) to reflect differences in the CI tests, we design three GPU-based variants of the PC algorithm tailored to CI tests that handle data with the following characteristics. We develop one variant for data assuming the Gaussian distribution model, one for discrete data, and another for mixed discrete-continuous data and data with non-linear relationships. Each variant is optimized for the appropriate CI test leveraging GPU hardware properties, such as shared or thread-local memory. Our GPU-accelerated variants outperform state-of-the-art parallel CPU-based algorithms by factors of up to 93.4× for data assuming the Gaussian distribution model, up to 54.3× for discrete data, up to 240× for continuous data with non-linear relationships and up to 655× for mixed discrete-continuous data. However, the proposed GPU-based variants are limited to datasets that fit into a single GPU’s memory. (2) To overcome this shortcoming, we develop approaches to scale our GPU-based variants beyond a single GPU’s memory capacity. For example, we design an out-of-core GPU variant that employs explicit memory management to process arbitrary-sized datasets. Runtime measurements on a large gene expression dataset reveal that our out-of-core GPU variant is 364 times faster than a parallel CPU-based CSL algorithm. Overall, our proposed GPU-accelerated variants speed up CSL in numerous settings to foster CSL’s adoption in practice and research.
Decubitus is one of the most relevant diseases in nursing and the most expensive to treat. It is caused by sustained pressure on tissue, so it particularly affects bed-bound patients. This work lays a foundation for pressure mattress-based decubitus prophylaxis by implementing a solution to the single-frame 2D Human Pose Estimation problem.
For this, methods of Deep Learning are employed. Two approaches are examined, a coarse-to-fine Convolutional Neural Network for direct regression of joint coordinates and a U-Net for the derivation of probability distribution heatmaps.
We conclude that training our models on a combined dataset of the publicly available Bodies at Rest and SLP data yields the best results. Furthermore, various preprocessing techniques are investigated, and a hyperparameter optimization is performed to discover an improved model architecture.
Another finding indicates that the heatmap-based approach outperforms direct regression.
This model achieves a mean per-joint position error of 9.11 cm for the Bodies at Rest data and 7.43 cm for the SLP data.
We find that it generalizes well on data from mattresses other than those seen during training but has difficulties detecting the arms correctly.
Additionally, we give a brief overview of the medical data annotation tool annoto we developed in the bachelor project and furthermore conclude that the Scrum framework and agile practices enhanced our development workflow.
RailChain
(2023)
The RailChain project designed, implemented, and experimentally evaluated a juridical recorder that is based on a distributed consensus protocol. That juridical blockchain recorder has been realized as distributed ledger on board the advanced TrainLab (ICE-TD 605 017) of Deutsche Bahn.
For the project, a consortium consisting of DB Systel, Siemens, Siemens Mobility, the Hasso Plattner Institute for Digital Engineering, Technische Universität Braunschweig, TÜV Rheinland InterTraffic, and Spherity has been formed. These partners not only concentrated competencies in railway operation, computer science, regulation, and approval, but also combined experiences from industry, research from academia, and enthusiasm from startups.
Distributed ledger technologies (DLTs) define distributed databases and express a digital protocol for transactions between business partners without the need for a trusted intermediary. The implementation of a blockchain with real-time requirements for the local network of a railway system (e.g., interlocking or train) allows to log data in the distributed system verifiably in real-time. For this, railway-specific assumptions can be leveraged to make modifications to standard blockchains protocols.
EULYNX and OCORA (Open CCS On-board Reference Architecture) are parts of a future European reference architecture for control command and signalling (CCS, Reference CCS Architecture – RCA). Both architectural concepts outline heterogeneous IT systems with components from multiple manufacturers. Such systems introduce novel challenges for the approved and safety-relevant CCS of railways which were considered neither for road-side nor for on-board systems so far. Logging implementations, such as the common juridical recorder on vehicles, can no longer be realized as a central component of a single manufacturer. All centralized approaches are in question.
The research project RailChain is funded by the mFUND program and gives practical evidence that distributed consensus protocols are a proper means to immutably (for legal purposes) store state information of many system components from multiple manufacturers. The results of RailChain have been published, prototypically implemented, and experimentally evaluated in large-scale field tests on the advanced TrainLab. At the same time, the project showed how RailChain can be integrated into the road-side and on-board architecture given by OCORA and EULYNX.
Logged data can now be analysed sooner and also their trustworthiness is being increased. This enables, e.g., auditable predictive maintenance, because it is ensured that data is authentic and unmodified at any point in time.
The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners.
The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies.
This technical report presents results of research projects executed in 2018. Selected projects have presented their results on April 17th and November 14th 2017 at the Future SOC Lab Day events.
In this thesis, we investigate language learning in the formalisation of Gold [Gol67]. Here, a learner, being successively presented all information of a target language, conjectures which language it believes to be shown. Once these hypotheses converge syntactically to a correct explanation of the target language, the learning is considered successful. Fittingly, this is termed explanatory learning. To model learning strategies, we impose restrictions on the hypotheses made, for example requiring the conjectures to follow a monotonic behaviour. This way, we can study the impact a certain restriction has on learning.
Recently, the literature shifted towards map charting. Here, various seemingly unrelated restrictions are contrasted, unveiling interesting relations between them. The results are then depicted in maps. For explanatory learning, the literature already provides maps of common restrictions for various forms of data presentation.
In the case of behaviourally correct learning, where the learners are required to converge semantically instead of syntactically, the same restrictions as in explanatory learning have been investigated. However, a similarly complete picture regarding their interaction has not been presented yet.
In this thesis, we transfer the map charting approach to behaviourally correct learning. In particular, we complete the partial results from the literature for many well-studied restrictions and provide full maps for behaviourally correct learning with different types of data presentation. We also study properties of learners assessed important in the literature. We are interested whether learners are consistent, that is, whether their conjectures include the data they are built on. While learners cannot be assumed consistent in explanatory learning, the opposite is the case in behaviourally correct learning. Even further, it is known that learners following different restrictions may be assumed consistent. We contribute to the literature by showing that this is the case for all studied restrictions.
We also investigate mathematically interesting properties of learners. In particular, we are interested in whether learning under a given restriction may be done with strongly Bc-locking learners. Such learners are of particular value as they allow to apply simulation arguments when, for example, comparing two learning paradigms to each other. The literature gives a rich ground on when learners may be assumed strongly Bc-locking, which we complete for all studied restrictions.
The amount of data stored in databases and the complexity of database workloads are ever- increasing. Database management systems (DBMSs) offer many configuration options, such as index creation or unique constraints, which must be adapted to the specific instance to efficiently process large volumes of data. Currently, such database optimization is complicated, manual work performed by highly skilled database administrators (DBAs). In cloud scenarios, manual database optimization even becomes infeasible: it exceeds the abilities of the best DBAs due to the enormous number of deployed DBMS instances (some providers maintain millions of instances), missing domain knowledge resulting from data privacy requirements, and the complexity of the configuration tasks.
Therefore, we investigate how to automate the configuration of DBMSs efficiently with the help of unsupervised database optimization. While there are numerous configuration options, in this thesis, we focus on automatic index selection and the use of data dependencies, such as functional dependencies, for query optimization. Both aspects have an extensive performance impact and complement each other by approaching unsupervised database optimization from different perspectives.
Our contributions are as follows: (1) we survey automated state-of-the-art index selection algorithms regarding various criteria, e.g., their support for index interaction. We contribute an extensible platform for evaluating the performance of such algorithms with industry-standard datasets and workloads. The platform is well-received by the community and has led to follow-up research. With our platform, we derive the strengths and weaknesses of the investigated algorithms. We conclude that existing solutions often have scalability issues and cannot quickly determine (near-)optimal solutions for large problem instances. (2) To overcome these limitations, we present two new algorithms. Extend determines (near-)optimal solutions with an iterative heuristic. It identifies the best index configurations for the evaluated benchmarks. Its selection runtimes are up to 10 times lower compared with other near-optimal approaches. SWIRL is based on reinforcement learning and delivers solutions instantly. These solutions perform within 3 % of the optimal ones. Extend and SWIRL are available as open-source implementations.
(3) Our index selection efforts are complemented by a mechanism that analyzes workloads to determine data dependencies for query optimization in an unsupervised fashion. We describe and classify 58 query optimization techniques based on functional, order, and inclusion dependencies as well as on unique column combinations. The unsupervised mechanism and three optimization techniques are implemented in our open-source research DBMS Hyrise. Our approach reduces the Join Order Benchmark’s runtime by 26 % and accelerates some TPC-DS queries by up to 58 times.
Additionally, we have developed a cockpit for unsupervised database optimization that allows interactive experiments to build confidence in such automated techniques. In summary, our contributions improve the performance of DBMSs, support DBAs in their work, and enable them to contribute their time to other, less arduous tasks.
Pictures are a medium that helps make the past tangible and preserve memories. Without context, they are not able to do so. Pictures are brought to life by their associated stories. However, the older pictures become, the fewer contemporary witnesses can tell these stories.
Especially for large, analog picture archives, knowledge and memories are spread over many people. This creates several challenges: First, the pictures must be digitized to save them from decaying and make them available to the public. Since a simple listing of all the pictures is confusing, the pictures should be structured accessibly. Second, known information that makes the stories vivid needs to be added to the pictures. Users should get the opportunity to contribute their knowledge and memories. To make this usable for all interested parties, even for older, less technophile generations, the interface should be intuitive and error-tolerant.
The resulting requirements are not covered in their entirety by any existing software solution without losing the intuitive interface or the scalability of the system.
Therefore, we have developed our digital picture archive within the scope of a bachelor project in cooperation with the Bad Harzburg-Stiftung. For the implementation of this web application, we use the UI framework React in the frontend, which communicates via a GraphQL interface with the Content Management System Strapi in the backend. The use of this system enables our project partner to create an efficient process from scanning analog pictures to presenting them to visitors in an organized and annotated way. To customize the solution for both picture delivery and information contribution for our target group, we designed prototypes and evaluated them with people from Bad Harzburg. This helped us gain valuable insights into our system’s usability and future challenges as well as requirements.
Our web application is already being used daily by our project partner. During the project, we still came up with numerous ideas for additional features to further support the exchange of knowledge.
Like conventional software projects, projects in model-driven software engineering require adequate management of multiple versions of development artifacts, importantly allowing living with temporary inconsistencies. In the case of model-driven software engineering, employed versioning approaches also have to handle situations where different artifacts, that is, different models, are linked via automatic model transformations.
In this report, we propose a technique for jointly handling the transformation of multiple versions of a source model into corresponding versions of a target model, which enables the use of a more compact representation that may afford improved execution time of both the transformation and further analysis operations. Our approach is based on the well-known formalism of triple graph grammars and a previously introduced encoding of model version histories called multi-version models. In addition to showing the correctness of our approach with respect to the standard semantics of triple graph grammars, we conduct an empirical evaluation that demonstrates the potential benefit regarding execution time performance.
Modular and incremental global model management with extended generalized discrimination networks
(2023)
Complex projects developed under the model-driven engineering paradigm nowadays often involve several interrelated models, which are automatically processed via a multitude of model operations. Modular and incremental construction and execution of such networks of models and model operations are required to accommodate efficient development with potentially large-scale models. The underlying problem is also called Global Model Management.
In this report, we propose an approach to modular and incremental Global Model Management via an extension to the existing technique of Generalized Discrimination Networks (GDNs). In addition to further generalizing the notion of query operations employed in GDNs, we adapt the previously query-only mechanism to operations with side effects to integrate model transformation and model synchronization. We provide incremental algorithms for the execution of the resulting extended Generalized Discrimination Networks (eGDNs), as well as a prototypical implementation for a number of example eGDN operations.
Based on this prototypical implementation, we experiment with an application scenario from the software development domain to empirically evaluate our approach with respect to scalability and conceptually demonstrate its applicability in a typical scenario. Initial results confirm that the presented approach can indeed be employed to realize efficient Global Model Management in the considered scenario.
Here we present an exome-wide rare genetic variant association study for 30 blood biomarkers in 191,971 individuals in the UK Biobank. We compare gene- based association tests for separate functional variant categories to increase interpretability and identify 193 significant gene-biomarker associations. Genes associated with biomarkers were ~ 4.5-fold enriched for conferring Mendelian disorders. In addition to performing weighted gene-based variant collapsing tests, we design and apply variant-category-specific kernel-based tests that integrate quantitative functional variant effect predictions for mis- sense variants, splicing and the binding of RNA-binding proteins. For these tests, we present a computationally efficient combination of the likelihood- ratio and score tests that found 36% more associations than the score test alone while also controlling the type-1 error. Kernel-based tests identified 13% more associations than their gene-based collapsing counterparts and had advantages in the presence of gain of function missense variants. We introduce local collapsing by amino acid position for missense variants and use it to interpret associations and identify potential novel gain of function variants in PIEZO1. Our results show the benefits of investigating different functional mechanisms when performing rare-variant association tests, and demonstrate pervasive rare-variant contribution to biomarker variability.
Here we present an exome-wide rare genetic variant association study for 30 blood biomarkers in 191,971 individuals in the UK Biobank. We compare gene- based association tests for separate functional variant categories to increase interpretability and identify 193 significant gene-biomarker associations. Genes associated with biomarkers were ~ 4.5-fold enriched for conferring Mendelian disorders. In addition to performing weighted gene-based variant collapsing tests, we design and apply variant-category-specific kernel-based tests that integrate quantitative functional variant effect predictions for mis- sense variants, splicing and the binding of RNA-binding proteins. For these tests, we present a computationally efficient combination of the likelihood- ratio and score tests that found 36% more associations than the score test alone while also controlling the type-1 error. Kernel-based tests identified 13% more associations than their gene-based collapsing counterparts and had advantages in the presence of gain of function missense variants. We introduce local collapsing by amino acid position for missense variants and use it to interpret associations and identify potential novel gain of function variants in PIEZO1. Our results show the benefits of investigating different functional mechanisms when performing rare-variant association tests, and demonstrate pervasive rare-variant contribution to biomarker variability.
In the last two decades, process mining has developed from a niche
discipline to a significant research area with considerable impact on academia and industry. Process mining enables organisations to identify the running business processes from historical execution data. The first requirement of any process mining technique is an event log, an artifact that represents concrete business process executions in the form of sequence of events. These logs can be extracted from the organization's information systems and are used by process experts to retrieve deep insights from the organization's running processes. Considering the events pertaining to such logs, the process models can be automatically discovered and enhanced or annotated with performance-related information. Besides behavioral information, event logs contain domain specific data, albeit implicitly. However, such data are usually overlooked and, thus, not utilized to their full potential.
Within the process mining area, we address in this thesis the research gap of discovering, from event logs, the contextual information that cannot be captured by applying existing process mining techniques. Within this research gap, we identify four key problems and tackle them by looking at an event log from different angles. First, we address the problem of deriving an event log in the absence of a proper database access and domain knowledge. The second problem is related to the under-utilization of the implicit domain knowledge present in an event log that can increase the understandability of the discovered process model. Next, there is a lack of a holistic representation of the historical data manipulation at the process model level of abstraction. Last but not least, each process model presumes to be independent of other process models when discovered from an event log, thus, ignoring possible data dependencies between processes within an organization.
For each of the problems mentioned above, this thesis proposes a dedicated method. The first method provides a solution to extract an event log only from the transactions performed on the database that are stored in the form of redo logs. The second method deals with discovering the underlying data model that is implicitly embedded in the event log, thus, complementing the discovered process model with important domain knowledge information. The third method captures, on the process model level, how the data affects the running process instances. Lastly, the fourth method is about the discovery of the relations between business processes (i.e., how they exchange data) from a set of event logs and explicitly representing such complex interdependencies in a business process architecture.
All the methods introduced in this thesis are implemented as a prototype and their feasibility is proven by being applied on real-life event logs.
Learning from failure
(2022)
Regression testing is a widespread practice in today's software industry to ensure software product quality. Developers derive a set of test cases, and execute them frequently to ensure that their change did not adversely affect existing functionality. As the software product and its test suite grow, the time to feedback during regression test sessions increases, and impedes programmer productivity: developers wait longer for tests to complete, and delays in fault detection render fault removal increasingly difficult.
Test case prioritization addresses the problem of long feedback loops by reordering test cases, such that test cases of high failure probability run first, and test case failures become actionable early in the testing process. We ask, given test execution schedules reconstructed from publicly available data, to which extent can their fault detection efficiency improved, and which technique yields the most efficient test schedules with respect to APFD?
To this end, we recover regression 6200 test sessions from the build log files of Travis CI, a popular continuous integration service, and gather 62000 accompanying changelists. We evaluate the efficiency of current test schedules, and examine the prioritization results of state-of-the-art lightweight, history-based heuristics. We propose and evaluate a novel set of prioritization algorithms, which connect software changes and test failures in a matrix-like data structure.
Our studies indicate that the optimization potential is substantial, because the existing test plans score only 30% APFD. The predictive power of past test failures proves to be outstanding: simple heuristics, such as repeating tests with failures in recent sessions, result in efficiency scores of 95% APFD. The best-performing matrix-based heuristic achieves a similar score of 92.5% APFD. In contrast to prior approaches, we argue that matrix-based techniques are useful beyond the scope of effective prioritization, and enable a number of use cases involving software maintenance.
We validate our findings from continuous integration processes by extending a continuous testing tool within development environments with means of test prioritization, and pose further research questions. We think that our findings are suited to propel adoption of (continuous) testing practices, and that programmers' toolboxes should contain test prioritization as an existential productivity tool.
Multiplicative Up-Drift
(2020)
Drift analysis aims at translating the expected progress of an evolutionary algorithm (or more generally, a random process) into a probabilistic guarantee on its run time (hitting time). So far, drift arguments have been successfully employed in the rigorous analysis of evolutionary algorithms, however, only for the situation that the progress is constant or becomes weaker when approaching the target. Motivated by questions like how fast fit individuals take over a population, we analyze random processes exhibiting a (1+delta)-multiplicative growth in expectation. We prove a drift theorem translating this expected progress into a hitting time. This drift theorem gives a simple and insightful proof of the level-based theorem first proposed by Lehre (2011). Our version of this theorem has, for the first time, the best-possible near-linear dependence on 1/delta} (the previous results had an at least near-quadratic dependence), and it only requires a population size near-linear in delta (this was super-quadratic in previous results). These improvements immediately lead to stronger run time guarantees for a number of applications. We also discuss the case of large delta and show stronger results for this setting.
Cyber-physical systems often encompass complex concurrent behavior with timing constraints and probabilistic failures on demand. The analysis whether such systems with probabilistic timed behavior adhere to a given specification is essential. When the states of the system can be represented by graphs, the rule-based formalism of Probabilistic Timed Graph Transformation Systems (PTGTSs) can be used to suitably capture structure dynamics as well as probabilistic and timed behavior of the system. The model checking support for PTGTSs w.r.t. properties specified using Probabilistic Timed Computation Tree Logic (PTCTL) has been already presented. Moreover, for timed graph-based runtime monitoring, Metric Temporal Graph Logic (MTGL) has been developed for stating metric temporal properties on identified subgraphs and their structural changes over time. In this paper, we (a) extend MTGL to the Probabilistic Metric Temporal Graph Logic (PMTGL) by allowing for the specification of probabilistic properties, (b) adapt our MTGL satisfaction checking approach to PTGTSs, and (c) combine the approaches for PTCTL model checking and MTGL satisfaction checking to obtain a Bounded Model Checking (BMC) approach for PMTGL. In our evaluation, we apply an implementation of our BMC approach in AutoGraph to a running example.
Duplicate detection describes the process of finding multiple representations of the same real-world entity in the absence of a unique identifier, and has many application areas, such as customer relationship management, genealogy and social sciences, or online shopping. Due to the increasing amount of data in recent years, the problem has become even more challenging on the one hand, but has led to a renaissance in duplicate detection research on the other hand.
This thesis examines the effects and opportunities of transitive relationships on the duplicate detection process. Transitivity implies that if record pairs ⟨ri,rj⟩ and ⟨rj,rk⟩ are classified as duplicates, then also record pair ⟨ri,rk⟩ has to be a duplicate. However, this reasoning might contradict with the pairwise classification, which is usually based on the similarity of objects. An essential property of similarity, in contrast to equivalence, is that similarity is not necessarily transitive.
First, we experimentally evaluate the effect of an increasing data volume on the threshold selection to classify whether a record pair is a duplicate or non-duplicate. Our experiments show that independently of the pair selection algorithm and the used similarity measure, selecting a suitable threshold becomes more difficult with an increasing number of records due to an increased probability of adding a false duplicate to an existing cluster. Thus, the best threshold changes with the dataset size, and a good threshold for a small (possibly sampled) dataset is not necessarily a good threshold for a larger (possibly complete) dataset. As data grows over time, earlier selected thresholds are no longer a suitable choice, and the problem becomes worse for datasets with larger clusters.
Second, we present with the Duplicate Count Strategy (DCS) and its enhancement DCS++ two alternatives to the standard Sorted Neighborhood Method (SNM) for the selection of candidate record pairs. DCS adapts SNMs window size based on the number of detected duplicates and DCS++ uses transitive dependencies to save complex comparisons for finding duplicates in larger clusters. We prove that with a proper (domain- and data-independent!) threshold, DCS++ is more efficient than SNM without loss of effectiveness.
Third, we tackle the problem of contradicting pairwise classifications. Usually, the transitive closure is used for pairwise classifications to obtain a transitively closed result set. However, the transitive closure disregards negative classifications. We present three new and several existing clustering algorithms and experimentally evaluate them on various datasets and under various algorithm configurations. The results show that the commonly used transitive closure is inferior to most other clustering algorithms, especially for the precision of results. In scenarios with larger clusters, our proposed EMCC algorithm is, together with Markov Clustering, the best performing clustering approach for duplicate detection, although its runtime is longer than Markov Clustering due to the subexponential time complexity. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.
Polyglot programming allows developers to use multiple programming languages within the same software project. While it is common to use more than one language in certain programming domains, developers also apply polyglot programming for other purposes such as to re-use software written in other languages. Although established approaches to polyglot programming come with significant limitations, for example, in terms of performance and tool support, developers still use them to be able to combine languages.
Polyglot virtual machines (VMs) such as GraalVM provide a new level of polyglot programming, allowing languages to directly interact with each other. This reduces the amount of glue code needed to combine languages, results in better performance, and enables tools such as debuggers to work across languages. However, only a little research has focused on novel tools that are designed to support developers in building software with polyglot VMs. One reason is that tool-building is often an expensive activity, another one is that polyglot VMs are still a moving target as their use cases and requirements are not yet well understood.
In this thesis, we present an approach that builds on existing self-sustaining programming systems such as Squeak/Smalltalk to enable exploratory programming, a practice for exploring and gathering software requirements, and re-use their extensive tool-building capabilities in the context of polyglot VMs. Based on TruffleSqueak, our implementation for the GraalVM, we further present five case studies that demonstrate how our approach helps tool developers to design and build tools for polyglot programming. We further show that TruffleSqueak can also be used by application developers to build and evolve polyglot applications at run-time and by language and runtime developers to understand the dynamic behavior of GraalVM languages and internals. Since our platform allows all these developers to apply polyglot programming, it can further help to better understand the advantages, use cases, requirements, and challenges of polyglot VMs. Moreover, we demonstrate that our approach can also be applied to other polyglot VMs and that insights gained through it are transferable to other programming systems.
We conclude that our research on tools for polyglot programming is an important step toward making polyglot VMs more approachable for developers in practice. With good tool support, we believe polyglot VMs can make it much more common for developers to take advantage of multiple languages and their ecosystems when building software.
Identity management is at the forefront of applications’ security posture. It separates the unauthorised user from the legitimate individual. Identity management models have evolved from the isolated to the centralised paradigm and identity federations. Within this advancement, the identity provider emerged as a trusted third party that holds a powerful position. Allen postulated the novel self-sovereign identity paradigm to establish a new balance. Thus, extensive research is required to comprehend its virtues and limitations. Analysing the new paradigm, initially, we investigate the blockchain-based self-sovereign identity concept structurally. Moreover, we examine trust requirements in this context by reference to patterns. These shapes comprise major entities linked by a decentralised identity provider. By comparison to the traditional models, we conclude that trust in credential management and authentication is removed. Trust-enhancing attribute aggregation based on multiple attribute providers provokes a further trust shift. Subsequently, we formalise attribute assurance trust modelling by a metaframework. It encompasses the attestation and trust network as well as the trust decision process, including the trust function, as central components. A secure attribute assurance trust model depends on the security of the trust function. The trust function should consider high trust values and several attribute authorities. Furthermore, we evaluate classification, conceptual study, practical analysis and simulation as assessment strategies of trust models. For realising trust-enhancing attribute aggregation, we propose a probabilistic approach. The method exerts the principle characteristics of correctness and validity. These values are combined for one provider and subsequently for multiple issuers. We embed this trust function in a model within the self-sovereign identity ecosystem. To practically apply the trust function and solve several challenges for the service provider that arise from adopting self-sovereign identity solutions, we conceptualise and implement an identity broker. The mediator applies a component-based architecture to abstract from a single solution. Standard identity and access management protocols build the interface for applications. We can conclude that the broker’s usage at the side of the service provider does not undermine self-sovereign principles, but fosters the advancement of the ecosystem. The identity broker is applied to sample web applications with distinct attribute requirements to showcase usefulness for authentication and attribute-based access control within a case study.
Modeling and Formal Analysis of Meta-Ecosystems with Dynamic Structure using Graph Transformation
(2022)
The dynamics of ecosystems is of crucial importance. Various model-based approaches exist to understand and analyze their internal effects. In this paper, we model the space structure dynamics and ecological dynamics of meta-ecosystems using the formal technique of Graph Transformation (short GT). We build GT models to describe how a meta-ecosystem (modeled as a graph) can evolve over time (modeled by GT rules) and to analyze these GT models with respect to qualitative properties such as the existence of structural stabilities. As a case study, we build three GT models describing the space structure dynamics and ecological dynamics of three different savanna meta-ecosystems. The first GT model considers a savanna meta-ecosystem that is limited in space to two ecosystem patches, whereas the other two GT models consider two savanna meta-ecosystems that are unlimited in the number of ecosystem patches and only differ in one GT rule describing how the space structure of the meta-ecosystem grows. In the first two GT models, the space structure dynamics and ecological dynamics of the meta-ecosystem shows two main structural stabilities: the first one based on grassland-savanna-woodland transitions and the second one based on grassland-desert transitions. The transition between these two structural stabilities is driven by high-intensity fires affecting the tree components. In the third GT model, the GT rule for savanna regeneration induces desertification and therefore a collapse of the meta-ecosystem. We believe that GT models provide a complementary avenue to that of existing approaches to rigorously study ecological phenomena.
Cyber-physical systems often encompass complex concurrent behavior with timing constraints and probabilistic failures on demand. The analysis whether such systems with probabilistic timed behavior adhere to a given specification is essential. When the states of the system can be represented by graphs, the rule-based formalism of Probabilistic Timed Graph Transformation Systems (PTGTSs) can be used to suitably capture structure dynamics as well as probabilistic and timed behavior of the system. The model checking support for PTGTSs w.r.t. properties specified using Probabilistic Timed Computation Tree Logic (PTCTL) has been already presented. Moreover, for timed graph-based runtime monitoring, Metric Temporal Graph Logic (MTGL) has been developed for stating metric temporal properties on identified subgraphs and their structural changes over time.
In this paper, we (a) extend MTGL to the Probabilistic Metric Temporal Graph Logic (PMTGL) by allowing for the specification of probabilistic properties, (b) adapt our MTGL satisfaction checking approach to PTGTSs, and (c) combine the approaches for PTCTL model checking and MTGL satisfaction checking to obtain a Bounded Model Checking (BMC) approach for PMTGL. In our evaluation, we apply an implementation of our BMC approach in AutoGraph to a running example.
The analysis of behavioral models such as Graph Transformation Systems (GTSs) is of central importance in model-driven engineering. However, GTSs often result in intractably large or even infinite state spaces and may be equipped with multiple or even infinitely many start graphs. To mitigate these problems, static analysis techniques based on finite symbolic representations of sets of states or paths thereof have been devised. We focus on the technique of k-induction for establishing invariants specified using graph conditions. To this end, k-induction generates symbolic paths backwards from a symbolic state representing a violation of a candidate invariant to gather information on how that violation could have been reached possibly obtaining contradictions to assumed invariants. However, GTSs where multiple agents regularly perform actions independently from each other cannot be analyzed using this technique as of now as the independence among backward steps may prevent the gathering of relevant knowledge altogether.
In this paper, we extend k-induction to GTSs with multiple agents thereby supporting a wide range of additional GTSs. As a running example, we consider an unbounded number of shuttles driving on a large-scale track topology, which adjust their velocity to speed limits to avoid derailing. As central contribution, we develop pruning techniques based on causality and independence among backward steps and verify that k-induction remains sound under this adaptation as well as terminates in cases where it did not terminate before.
It is estimated that data scientists spend up to 80% of the time exploring, cleaning, and transforming their data. A major reason for that expenditure is the lack of knowledge about the used data, which are often from different sources and have heterogeneous structures. As a means to describe various properties of data, metadata can help data scientists understand and prepare their data, saving time for innovative and valuable data analytics. However, metadata do not always exist: some data file formats are not capable of storing them; metadata were deleted for privacy concerns; legacy data may have been produced by systems that were not designed to store and handle meta- data. As data are being produced at an unprecedentedly fast pace and stored in diverse formats, manually creating metadata is not only impractical but also error-prone, demanding automatic approaches for metadata detection.
In this thesis, we are focused on detecting metadata in CSV files – a type of plain-text file that, similar to spreadsheets, may contain different types of content at arbitrary positions. We propose a taxonomy of metadata in CSV files and specifically address the discovery of three different metadata: line and cell type, aggregations, and primary keys and foreign keys.
Data are organized in an ad-hoc manner in CSV files, and do not follow a fixed structure, which is assumed by common data processing tools. Detecting the structure of such files is a prerequisite of extracting information from them, which can be addressed by detecting the semantic type, such as header, data, derived, or footnote, of each line or each cell. We propose the supervised- learning approach Strudel to detect the type of lines and cells. CSV files may also include aggregations. An aggregation represents the arithmetic relationship between a numeric cell and a set of other numeric cells. Our proposed AggreCol algorithm is capable of detecting aggregations of five arithmetic functions in CSV files. Note that stylistic features, such as font style and cell background color, do not exist in CSV files. Our proposed algorithms address the respective problems by using only content, contextual, and computational features.
Storing a relational table is also a common usage of CSV files. Primary keys and foreign keys are important metadata for relational databases, which are usually not present for database instances dumped as plain-text files. We propose the HoPF algorithm to holistically detect both constraints in relational databases. Our approach is capable of distinguishing true primary and foreign keys from a great amount of spurious unique column combinations and inclusion dependencies, which can be detected by state-of-the-art data profiling algorithms.
openHPI
(2022)
Anlässlich des 10-jährigen Jubiläums von openHPI informiert dieser technische Bericht über die HPI-MOOC-Plattform einschließlich ihrer Kernfunktionen, Technologie und Architektur.
In einer Einleitung wird die Plattformfamilie mit allen Partnerplattformen vorgestellt; diese belaufen sich inklusive openHPI aktuell auf neun Plattformen. In diesem Abschnitt wird außerdem gezeigt, wie openHPI als Berater und Forschungspartner in verschiedenen Projekten fungiert.
Im zweiten Kapitel werden die Funktionalitäten und gängigen Kursformate der Plattform präsentiert. Die Funktionalitäten sind in Lerner- und Admin-Funktionen unterteilt. Der Bereich Lernerfunktionen bietet detaillierte Informationen zu Leistungsnachweisen, Kursen und den Lernmaterialien, aus denen sich ein Kurs zusammensetzt: Videos, Texte und Quiz. Darüber hinaus können die Lernmaterialien durch externe Übungstools angereichert werden, die über den Standard Learning Tools Interoperability (LTI) mit der HPI MOOC-Plattform kommunizieren. Das Konzept der Peer-Assessments rundet die möglichen Lernmaterialien ab.
Der Abschnitt geht dann weiter auf das Diskussionsforum ein, das einen grundlegenden Unterschied von MOOCs im Vergleich zu traditionellen E-Learning-Angeboten darstellt. Zum Abschluss des Abschnitts folgen eine Beschreibung von Quiz-Recap, Lernzielen, mobilen Anwendungen, spielerischen Lernens und dem Helpdesk.
Der nächste Teil dieses Kapitels beschäftigt sich mit den Admin-Funktionen. Die Funktionalitätsbeschreibung beschränkt sich Neuigkeiten und Ankündigungen, Dashboards und Statistiken, Berichtsfunktionen, Forschungsoptionen mit A/B-Tests, den Kurs-Feed und das TransPipe-Tool zur Unterstützung beim Erstellen von automatischen oder manuellen Untertiteln. Die Plattform unterstützt außerdem eine Vielzahl zusätzlicher Funktionen, doch eine detaillierte Beschreibung dieser Funktionen würde den Rahmen des Berichts sprengen.
Das Kapitel geht dann auf gängige Kursformate und openHPI-Lehrveranstaltungen am HPI ein, bevor es mit einigen Best Practices für die Gestaltung und Durchführung von Kursen schließt.
Zum Abschluss des technischen Berichts gibt das letzte Kapitel eine Zusammenfassung und einen Ausblick auf die Zukunft der digitalen Bildung.
Ein besonderes Merkmal des openHPI-Projekts ist die bewusste Entscheidung, die komplette Anwendung von den physischen Netzwerkkomponenten bis zur Plattformentwicklung eigenständig zu betreiben. Bei der vorliegenden deutschen Variante handelt es sich um eine gekürzte Übersetzung des technischen Berichts 148, bei der kein Einblick in die Technologien und Architektur von openHPI gegeben wird. Interessierte Leser:innen können im technischen Bericht 148 (vollständige englische Version) detaillierte Informationen zum Rechenzentrum und den Geräten, der Cloud-Software und dem openHPI Cloud Service aber auch zu Infrastruktur-Anwendungskomponenten wie Entwicklungstools, Automatisierung, Deployment-Pipeline und Monitoring erhalten. Außerdem finden sich dort weitere Informationen über den Technologiestack und konkrete Implementierungsdetails der Plattform inklusive der serviceorientierten Ruby on Rails-Anwendung, die Kommunikation zwischen den Diensten, öffentliche APIs, sowie Designsystem und -komponenten. Der Abschnitt schließt mit einer Diskussion über die ursprüngliche Microservice-Architektur und die Migration zu einer monolithischen Anwendung.
openHPI
(2022)
On the occasion of the 10th openHPI anniversary, this technical report provides information about the HPI MOOC platform, including its core features, technology, and architecture.
In an introduction, the platform family with all partner platforms is presented; these now amount to nine platforms, including openHPI. This section introduces openHPI as an advisor and research partner in various projects.
In the second chapter, the functionalities and common course formats of the platform are presented. The functionalities are divided into learner and admin features. The learner features section provides detailed information about performance records, courses, and the learning materials of which a course is composed: videos, texts, and quizzes. In addition, the learning materials can be enriched by adding external exercise tools that communicate with the HPI MOOC platform via the Learning Tools Interoperability (LTI) standard. Furthermore, the concept of peer assessments completed the possible learning materials.
The section then proceeds with further information on the discussion forum, a fundamental concept of MOOCs compared to traditional e-learning offers. The section is concluded with a description of the quiz recap, learning objectives, mobile applications, gameful learning, and the help desk.
The next part of this chapter deals with the admin features. The described functionality is restricted to describing the news and announcements, dashboards and statistics, reporting capabilities, research options with A/B testing, the course feed, and the TransPipe tool to support the process of creating automated or manual subtitles. The platform supports a large variety of additional features, but a detailed description of these features goes beyond the scope of this report.
The chapter then elaborates on common course formats and openHPI teaching activities at the HPI. The chapter concludes with some best practices for course design and delivery.
The third chapter provides insights into the technology and architecture behind openHPI. A special characteristic of the openHPI project is the conscious decision to operate the complete application from bare metal to platform development. Hence, the chapter starts with a section about the openHPI Cloud, including detailed information about the data center and devices, the used cloud software OpenStack and Ceph, as well as the openHPI Cloud Service provided for the HPI.
Afterward, a section on the application technology stack and development tooling describes the application infrastructure components, the used automation, the deployment pipeline, and the tools used for monitoring and alerting. The chapter is concluded with detailed information about the technology stack and concrete platform implementation details. The section describes the service-oriented Ruby on Rails application, inter-service communication, and public APIs. It also provides more information on the design system and components used in the application. The section concludes with a discussion of the original microservice architecture, where we share our insights and reasoning for migrating back to a monolithic application.
The last chapter provides a summary and an outlook on the future of digital education.
Social networking sites (SNS) are a rich source of latent information about individual characteristics. Crawling and analyzing this content provides a new approach for enterprises to personalize services and put forward product recommendations. In the past few years, commercial brands made a gradual appearance on social media platforms for advertisement, customers support and public relation purposes and by now it became a necessity throughout all branches. This online identity can be represented as a brand personality that reflects how a brand is perceived by its customers. We exploited recent research in text analysis and personality detection to build an automatic brand personality prediction model on top of the (Five-Factor Model) and (Linguistic Inquiry and Word Count) features extracted from publicly available benchmarks. Predictive evaluation on brands' accounts reveals that Facebook platform provides a slight advantage over Twitter platform in offering more self-disclosure for users' to express their emotions especially their demographic and psychological traits. Results also confirm the wider perspective that the same social media account carry a quite similar and comparable personality scores over different social media platforms. For evaluating our prediction results on actual brands' accounts, we crawled the Facebook API and Twitter API respectively for 100k posts from the most valuable brands' pages in the USA and we visualize exemplars of comparison results and present suggestions for future directions.
Scrollytellings are an innovative form of web content. Combining the benefits of books, images, movies, and video games, they are a tool to tell compelling stories and provide excellent learning opportunities. Due to their multi-modality, creating high-quality scrollytellings is not an easy task. Different professions, such as content designers, graphics designers, and developers, need to collaborate to get the best out of the possibilities the scrollytelling format provides. Collaboration unlocks great potential. However, content designers cannot create scrollytellings directly and always need to consult with developers to implement their vision. This can result in misunderstandings. Often, the resulting scrollytelling will not match the designer’s vision sufficiently, causing unnecessary iterations. Our project partner Typeshift specializes in the creation of individualized scrollytellings for their clients. Examined existing solutions for authoring interactive content are not optimally suited for creating highly customized scrollytellings while still being able to manipulate all their elements programmatically. Based on their experience and expertise, we developed an editor to author scrollytellings in the lively.next live-programming environment. In this environment, a graphical user interface for content design is combined with powerful possibilities for programming behavior with the morphic system. The editor allows content designers to take on large parts of the creation process of scrollytellings on their own, such as creating the visible elements, animating content, and fine-tuning the scrollytelling. Hence, developers can focus on interactive elements such as simulations and games. Together with Typeshift, we evaluated the tool by recreating an existing scrollytelling and identified possible future enhancements. Our editor streamlines the creation process of scrollytellings. Content designers and developers can now both work on the same scrollytelling. Due to the editor inside of the lively.next environment, they can both work with a set of tools familiar to them and their traits. Thus, we mitigate unnecessary iterations and misunderstandings by enabling content designers to realize large parts of their vision of a scrollytelling on their own. Developers can add advanced and individual behavior. Thus, developers and content designers benefit from a clearer distribution of tasks while keeping the benefits of collaboration.
Viper
(2021)
Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4-18x for write workloads, while matching or surpassing their get performance.
Text is a ubiquitous entity in our world and daily life. We encounter it nearly everywhere in shops, on the street, or in our flats. Nowadays, more and more text is contained in digital images. These images are either taken using cameras, e.g., smartphone cameras, or taken using scanning devices such as document scanners. The sheer amount of available data, e.g., millions of images taken by Google Streetview, prohibits manual analysis and metadata extraction. Although much progress was made in the area of optical character recognition (OCR) for printed text in documents, broad areas of OCR are still not fully explored and hold many research challenges. With the mainstream usage of machine learning and especially deep learning, one of the most pressing problems is the availability and acquisition of annotated ground truth for the training of machine learning models because obtaining annotated training data using manual annotation mechanisms is time-consuming and costly. In this thesis, we address of how we can reduce the costs of acquiring ground truth annotations for the application of state-of-the-art machine learning methods to optical character recognition pipelines. To this end, we investigate how we can reduce the annotation cost by using only a fraction of the typically required ground truth annotations, e.g., for scene text recognition systems. We also investigate how we can use synthetic data to reduce the need of manual annotation work, e.g., in the area of document analysis for archival material. In the area of scene text recognition, we have developed a novel end-to-end scene text recognition system that can be trained using inexact supervision and shows competitive/state-of-the-art performance on standard benchmark datasets for scene text recognition. Our method consists of two independent neural networks, combined using spatial transformer networks. Both networks learn together to perform text localization and text recognition at the same time while only using annotations for the recognition task. We apply our model to end-to-end scene text recognition (meaning localization and recognition of words) and pure scene text recognition without any changes in the network architecture.
In the second part of this thesis, we introduce novel approaches for using and generating synthetic data to analyze handwriting in archival data. First, we propose a novel preprocessing method to determine whether a given document page contains any handwriting. We propose a novel data synthesis strategy to train a classification model and show that our data synthesis strategy is viable by evaluating the trained model on real images from an archive. Second, we introduce the new analysis task of handwriting classification. Handwriting classification entails classifying a given handwritten word image into classes such as date, word, or number. Such an analysis step allows us to select the best fitting recognition model for subsequent text recognition; it also allows us to reason about the semantic content of a given document page without the need for fine-grained text recognition and further analysis steps, such as Named Entity Recognition. We show that our proposed approaches work well when trained on synthetic data. Further, we propose a flexible metric learning approach to allow zero-shot classification of classes unseen during the network’s training. Last, we propose a novel data synthesis algorithm to train off-the-shelf pixel-wise semantic segmentation networks for documents. Our data synthesis pipeline is based on the famous Style-GAN architecture and can synthesize realistic document images with their corresponding segmentation annotation without the need for any annotated data!
Which event happened first?
(2021)
First come, first served: Critical choices between alternative actions are often made based on events external to an organization, and reacting promptly to their occurrence can be a major advantage over the competition. In Business Process Management (BPM), such deferred choices can be expressed in process models, and they are an important aspect of process engines. Blockchain-based process execution approaches are no exception to this, but are severely limited by the inherent properties of the platform: The isolated environment prevents direct access to external entities and data, and the non-continual runtime based entirely on atomic transactions impedes the monitoring and detection of events. In this paper we provide an in-depth examination of the semantics of deferred choice, and transfer them to environments such as the blockchain. We introduce and compare several oracle architectures able to satisfy certain requirements, and show that they can be implemented using state-of-the-art blockchain technology.
First come, first served: Critical choices between alternative actions are often made based on events external to an organization, and reacting promptly to their occurrence can be a major advantage over the competition. In Business Process Management (BPM), such deferred choices can be expressed in process models, and they are an important aspect of process engines. Blockchain-based process execution approaches are no exception to this, but are severely limited by the inherent properties of the platform: The isolated environment prevents direct access to external entities and data, and the non-continual runtime based entirely on atomic transactions impedes the monitoring and detection of events. In this paper we provide an in-depth examination of the semantics of deferred choice, and transfer them to environments such as the blockchain. We introduce and compare several oracle architectures able to satisfy certain requirements, and show that they can be implemented using state-of-the-art blockchain technology.
Language developers who design domain-specific languages or new language features need a way to make fast changes to language definitions. Those fast changes require immediate feedback. Also, it should be possible to parse the developed languages quickly to handle extensive sets of code.
Parsing expression grammars provides an easy to understand method for language definitions. Packrat parsing is a method to parse grammars of this kind, but this method is unable to handle left-recursion properly. Existing solutions either partially rewrite left-recursive rules and partly forbid them, or use complex extensions to packrat parsing that are hard to understand and cost-intensive. We investigated methods to make parsing as fast as possible, using easy to follow algorithms while not losing the ability to make fast changes to grammars.
We focused our efforts on two approaches.
One is to start from an existing technique for limited left-recursion rewriting and enhance it to work for general left-recursive grammars. The second approach is to design a grammar compilation process to find left-recursion before parsing, and in this way, reduce computational costs wherever possible and generate ready to use parser classes.
Rewriting parsing expression grammars is a task that, if done in a general way, unveils a large number of cases such that any rewriting algorithm surpasses the complexity of other left-recursive parsing algorithms. Lookahead operators introduce this complexity. However, most languages have only little portions that are left-recursive and in virtually all cases, have no indirect or hidden left-recursion. This means that the distinction of left-recursive parts of grammars from components that are non-left-recursive holds great improvement potential for existing parsers.
In this report, we list all the required steps for grammar rewriting to handle left-recursion, including grammar analysis, grammar rewriting itself, and syntax tree restructuring. Also, we describe the implementation of a parsing expression grammar framework in Squeak/Smalltalk and the possible interactions with the already existing parser Ohm/S. We quantitatively benchmarked this framework directing our focus on parsing time and the ability to use it in a live programming context. Compared with Ohm, we achieved massive parsing time improvements while preserving the ability to use our parser it as a live programming tool.
The work is essential because, for one, we outlined the difficulties and complexity that come with grammar rewriting. Also, we removed the existing limitations that came with left-recursion by eliminating them before parsing.