004 Datenverarbeitung; Informatik
Refine
Has Fulltext
- yes (582) (remove)
Year of publication
Document Type
- Article (176)
- Monograph/Edited Volume (165)
- Doctoral Thesis (137)
- Postprint (50)
- Conference Proceeding (38)
- Master's Thesis (10)
- Preprint (3)
- Bachelor Thesis (1)
- Habilitation Thesis (1)
- Moving Images (1)
Language
- English (417)
- German (163)
- Multiple languages (2)
Keywords
- Informatik (21)
- machine learning (17)
- Didaktik (15)
- Hochschuldidaktik (14)
- Ausbildung (13)
- Cloud Computing (13)
- cloud computing (13)
- answer set programming (11)
- maschinelles Lernen (11)
- Forschungsprojekte (10)
Institute
- Institut für Informatik und Computational Science (207)
- Hasso-Plattner-Institut für Digital Engineering gGmbH (168)
- Hasso-Plattner-Institut für Digital Engineering GmbH (110)
- Extern (62)
- Mathematisch-Naturwissenschaftliche Fakultät (24)
- Institut für Mathematik (10)
- Digital Engineering Fakultät (8)
- Wirtschaftswissenschaften (8)
- Institut für Umweltwissenschaften und Geographie (5)
- Institut für Physik und Astronomie (4)
To manage tabular data files and leverage their content in a given downstream task, practitioners often design and execute complex transformation pipelines to prepare them. The complexity of such pipelines stems from different factors, including the nature of the preparation tasks, often exploratory or ad-hoc to specific datasets; the large repertory of tools, algorithms, and frameworks that practitioners need to master; and the volume, variety, and velocity of the files to be prepared. Metadata plays a fundamental role in reducing this complexity: characterizing a file assists end users in the design of data preprocessing pipelines, and furthermore paves the way for suggestion, automation, and optimization of data preparation tasks.
Previous research in the areas of data profiling, data integration, and data cleaning, has focused on extracting and characterizing metadata regarding the content of tabular data files, i.e., about the records and attributes of tables. Content metadata are useful for the latter stages of a preprocessing pipeline, e.g., error correction, duplicate detection, or value normalization, but they require a properly formed tabular input. Therefore, these metadata are not relevant for the early stages of a preparation pipeline, i.e., to correctly parse tables out of files. In this dissertation, we turn our focus to what we call the structure of a tabular data file, i.e., the set of characters within a file that do not represent data values but are required to parse and understand the content of the file. We provide three different approaches to represent file structure, an explicit representation based on context-free grammars; an implicit representation based on file-wise similarity; and a learned representation based on machine learning.
In our first contribution, we use the grammar-based representation to characterize a set of over 3000 real-world csv files and identify multiple structural issues that let files deviate from the csv standard, e.g., by having inconsistent delimiters or containing multiple tables. We leverage our learnings about real-world files and propose Pollock, a benchmark to test how well systems parse csv files that have a non-standard structure, without any previous preparation. We report on our experiments on using Pollock to evaluate the performance of 16 real-world data management systems.
Following, we characterize the structure of files implicitly, by defining a measure of structural similarity for file pairs. We design a novel algorithm to compute this measure, which is based on a graph representation of the files' content. We leverage this algorithm and propose Mondrian, a graphical system to assist users in identifying layout templates in a dataset, classes of files that have the same structure, and therefore can be prepared by applying the same preparation pipeline.
Finally, we introduce MaGRiTTE, a novel architecture that uses self-supervised learning to automatically learn structural representations of files in the form of vectorial embeddings at three different levels: cell level, row level, and file level. We experiment with the application of structural embeddings for several tasks, namely dialect detection, row classification, and data preparation efforts estimation.
Our experimental results show that structural metadata, either identified explicitly on parsing grammars, derived implicitly as file-wise similarity, or learned with the help of machine learning architectures, is fundamental to automate several tasks, to scale up preparation to large quantities of files, and to provide repeatable preparation pipelines.
Column-oriented database systems can efficiently process transactional and analytical queries on a single node. However, increasing or peak analytical loads can quickly saturate single-node database systems. Then, a common scale-out option is using a database cluster with a single primary node for transaction processing and read-only replicas. Using (the naive) full replication, queries are distributed among nodes independently of the accessed data. This approach is relatively expensive because all nodes must store all data and apply all data modifications caused by inserts, deletes, or updates.
In contrast to full replication, partial replication is a more cost-efficient implementation: Instead of duplicating all data to all replica nodes, partial replicas store only a subset of the data while being able to process a large workload share. Besides lower storage costs, partial replicas enable (i) better scaling because replicas must potentially synchronize only subsets of the data modifications and thus have more capacity for read-only queries and (ii) better elasticity because replicas have to load less data and can be set up faster. However, splitting the overall workload evenly among the replica nodes while optimizing the data allocation is a challenging assignment problem.
The calculation of optimized data allocations in a partially replicated database cluster can be modeled using integer linear programming (ILP). ILP is a common approach for solving assignment problems, also in the context of database systems. Because ILP is not scalable, existing approaches (also for calculating partial allocations) often fall back to simple (e.g., greedy) heuristics for larger problem instances. Simple heuristics may work well but can lose optimization potential.
In this thesis, we present optimal and ILP-based heuristic programming models for calculating data fragment allocations for partially replicated database clusters. Using ILP, we are flexible to extend our models to (i) consider data modifications and reallocations and (ii) increase the robustness of allocations to compensate for node failures and workload uncertainty. We evaluate our approaches for TPC-H, TPC-DS, and a real-world accounting workload and compare the results to state-of-the-art allocation approaches. Our evaluations show significant improvements for varied allocation’s properties: Compared to existing approaches, we can, for example, (i) almost halve the amount of allocated data, (ii) improve the throughput in case of node failures and workload uncertainty while using even less memory, (iii) halve the costs of data modifications, and (iv) reallocate less than 90% of data when adding a node to the cluster. Importantly, we can calculate the corresponding ILP-based heuristic solutions within a few seconds. Finally, we demonstrate that the ideas of our ILP-based heuristics are also applicable to the index selection problem.
The wide distribution of location-acquisition technologies means that large volumes of spatio-temporal data are continuously being accumulated. Positioning systems such as GPS enable the tracking of various moving objects' trajectories, which are usually represented by a chronologically ordered sequence of observed locations. The analysis of movement patterns based on detailed positional information creates opportunities for applications that can improve business decisions and processes in a broad spectrum of industries (e.g., transportation, traffic control, or medicine). Due to the large data volumes generated in these applications, the cost-efficient storage of spatio-temporal data is desirable, especially when in-memory database systems are used to achieve interactive performance requirements.
To efficiently utilize the available DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint (e.g., data compression) or increase performance (e.g., additional indexes structures). By considering horizontal data partitioning, we can independently apply different tuning options on a fine-grained level. However, the selection of cost and performance-balancing configurations is challenging, due to the vast number of possible setups consisting of mutually dependent individual decisions.
In this thesis, we introduce multiple approaches to improve spatio-temporal data management by automatically optimizing diverse tuning options for the application-specific access patterns and data characteristics. Our contributions are as follows:
(1) We introduce a novel approach to determine fine-grained table configurations for spatio-temporal workloads. Our linear programming (LP) approach jointly optimizes the (i) data compression, (ii) ordering, (iii) indexing, and (iv) tiering. We propose different models which address cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload, memory budgets, and data characteristics. To yield maintainable and robust configurations, we further extend our LP-based approach to incorporate reconfiguration costs as well as optimizations for multiple potential workload scenarios.
(2) To optimize the storage layout of timestamps in columnar databases, we present a heuristic approach for the workload-driven combined selection of a data layout and compression scheme. By considering attribute decomposition strategies, we are able to apply application-specific optimizations that reduce the memory footprint and improve performance.
(3) We introduce an approach that leverages past trajectory data to improve the dispatch processes of transportation network companies. Based on location probabilities, we developed risk-averse dispatch strategies that reduce critical delays.
(4) Finally, we used the use case of a transportation network company to evaluate our database optimizations on a real-world dataset. We demonstrate that workload-driven fine-grained optimizations allow us to reduce the memory footprint (up to 71% by equal performance) or increase the performance (up to 90% by equal memory size) compared to established rule-based heuristics.
Individually, our contributions provide novel approaches to the current challenges in spatio-temporal data mining and database research. Combining them allows in-memory databases to store and process spatio-temporal data more cost-efficiently.
Um in der Schule bereits frühzeitig ein Verständnis für informatische Prozesse zu vermitteln wurde das neue Informatikfach Digitale Welt für die Klassenstufe 5 konzipiert mit der bundesweit einmaligen Verbindung von Informatik mit anwendungsbezogenen und gesellschaftlich relevanten Bezügen zur Ökologie und Ökonomie. Der Technische Report gibt eine Handreichung zur Einführung des neuen Faches.
Deep learning has seen widespread application in many domains, mainly for its ability to learn data representations from raw input data. Nevertheless, its success has so far been coupled with the availability of large annotated (labelled) datasets. This is a requirement that is difficult to fulfil in several domains, such as in medical imaging. Annotation costs form a barrier in extending deep learning to clinically-relevant use cases. The labels associated with medical images are scarce, since the generation of expert annotations of multimodal patient data at scale is non-trivial, expensive, and time-consuming. This substantiates the need for algorithms that learn from the increasing amounts of unlabeled data. Self-supervised representation learning algorithms offer a pertinent solution, as they allow solving real-world (downstream) deep learning tasks with fewer annotations. Self-supervised approaches leverage unlabeled samples to acquire generic features about different concepts, enabling annotation-efficient downstream task solving subsequently.
Nevertheless, medical images present multiple unique and inherent challenges for existing self-supervised learning approaches, which we seek to address in this thesis: (i) medical images are multimodal, and their multiple modalities are heterogeneous in nature and imbalanced in quantities, e.g. MRI and CT; (ii) medical scans are multi-dimensional, often in 3D instead of 2D; (iii) disease patterns in medical scans are numerous and their incidence exhibits a long-tail distribution, so it is oftentimes essential to fuse knowledge from different data modalities, e.g. genomics or clinical data, to capture disease traits more comprehensively; (iv) Medical scans usually exhibit more uniform color density distributions, e.g. in dental X-Rays, than natural images. Our proposed self-supervised methods meet these challenges, besides significantly reducing the amounts of required annotations.
We evaluate our self-supervised methods on a wide array of medical imaging applications and tasks. Our experimental results demonstrate the obtained gains in both annotation-efficiency and performance; our proposed methods outperform many approaches from related literature. Additionally, in case of fusion with genetic modalities, our methods also allow for cross-modal interpretability. In this thesis, not only we show that self-supervised learning is capable of mitigating manual annotation costs, but also our proposed solutions demonstrate how to better utilize it in the medical imaging domain. Progress in self-supervised learning has the potential to extend deep learning algorithms application to clinical scenarios.
Knowledge about causal structures is crucial for decision support in various domains. For example, in discrete manufacturing, identifying the root causes of failures and quality deviations that interrupt the highly automated production process requires causal structural knowledge. However, in practice, root cause analysis is usually built upon individual expert knowledge about associative relationships. But, "correlation does not imply causation", and misinterpreting associations often leads to incorrect conclusions. Recent developments in methods for causal discovery from observational data have opened the opportunity for a data-driven examination. Despite its potential for data-driven decision support, omnipresent challenges impede causal discovery in real-world scenarios. In this thesis, we make a threefold contribution to improving causal discovery in practice.
(1) The growing interest in causal discovery has led to a broad spectrum of methods with specific assumptions on the data and various implementations. Hence, application in practice requires careful consideration of existing methods, which becomes laborious when dealing with various parameters, assumptions, and implementations in different programming languages. Additionally, evaluation is challenging due to the lack of ground truth in practice and limited benchmark data that reflect real-world data characteristics.
To address these issues, we present a platform-independent modular pipeline for causal discovery and a ground truth framework for synthetic data generation that provides comprehensive evaluation opportunities, e.g., to examine the accuracy of causal discovery methods in case of inappropriate assumptions.
(2) Applying constraint-based methods for causal discovery requires selecting a conditional independence (CI) test, which is particularly challenging in mixed discrete-continuous data omnipresent in many real-world scenarios. In this context, inappropriate assumptions on the data or the commonly applied discretization of continuous variables reduce the accuracy of CI decisions, leading to incorrect causal structures.
Therefore, we contribute a non-parametric CI test leveraging k-nearest neighbors methods and prove its statistical validity and power in mixed discrete-continuous data, as well as the asymptotic consistency when used in constraint-based causal discovery. An extensive evaluation of synthetic and real-world data shows that the proposed CI test outperforms state-of-the-art approaches in the accuracy of CI testing and causal discovery, particularly in settings with low sample sizes.
(3) To show the applicability and opportunities of causal discovery in practice, we examine our contributions in real-world discrete manufacturing use cases. For example, we showcase how causal structural knowledge helps to understand unforeseen production downtimes or adds decision support in case of failures and quality deviations in automotive body shop assembly lines.
HPI Future SOC Lab
(2024)
The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners.
The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies.
This technical report presents results of research projects executed in 2020. Selected projects have presented their results on April 21st and November 10th 2020 at the Future SOC Lab Day events.
The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners.
The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies.
This technical report presents results of research projects executed in 2019. Selected projects have presented their results on April 9th and November 12th 2019 at the Future SOC Lab Day events.
Data preparation stands as a cornerstone in the landscape of data science workflows, commanding a significant portion—approximately 80%—of a data scientist's time. The extensive time consumption in data preparation is primarily attributed to the intricate challenge faced by data scientists in devising tailored solutions for downstream tasks. This complexity is further magnified by the inadequate availability of metadata, the often ad-hoc nature of preparation tasks, and the necessity for data scientists to grapple with a diverse range of sophisticated tools, each presenting its unique intricacies and demands for proficiency.
Previous research in data management has traditionally concentrated on preparing the content within columns and rows of a relational table, addressing tasks, such as string disambiguation, date standardization, or numeric value normalization, commonly referred to as data cleaning. This focus assumes a perfectly structured input table. Consequently, the mentioned data cleaning tasks can be effectively applied only after the table has been successfully loaded into the respective data cleaning environment, typically in the later stages of the data processing pipeline.
While current data cleaning tools are well-suited for relational tables, extensive data repositories frequently contain data stored in plain text files, such as CSV files, due to their adaptable standard. Consequently, these files often exhibit tables with a flexible layout of rows and columns, lacking a relational structure. This flexibility often results in data being distributed across cells in arbitrary positions, typically guided by user-specified formatting guidelines.
Effectively extracting and leveraging these tables in subsequent processing stages necessitates accurate parsing. This thesis emphasizes what we define as the “structure” of a data file—the fundamental characters within a file essential for parsing and comprehending its content. Concentrating on the initial stages of the data preprocessing pipeline, this thesis addresses two crucial aspects: comprehending the structural layout of a table within a raw data file and automatically identifying and rectifying any structural issues that might hinder its parsing. Although these issues may not directly impact the table's content, they pose significant challenges in parsing the table within the file.
Our initial contribution comprises an extensive survey of commercially available data preparation tools. This survey thoroughly examines their distinct features, the lacking features, and the necessity for preliminary data processing despite these tools. The primary goal is to elucidate the current state-of-the-art in data preparation systems while identifying areas for enhancement. Furthermore, the survey explores the encountered challenges in data preprocessing, emphasizing opportunities for future research and improvement.
Next, we propose a novel data preparation pipeline designed for detecting and correcting structural errors. The aim of this pipeline is to assist users at the initial preprocessing stage by ensuring the correct loading of their data into their preferred systems. Our approach begins by introducing SURAGH, an unsupervised system that utilizes a pattern-based method to identify dominant patterns within a file, independent of external information, such as data types, row structures, or schemata. By identifying deviations from the dominant pattern, it detects ill-formed rows. Subsequently, our structure correction system, TASHEEH, gathers the identified ill-formed rows along with dominant patterns and employs a novel pattern transformation algebra to automatically rectify errors. Our pipeline serves as an end-to-end solution, transforming a structurally broken CSV file into a well-formatted one, usually suitable for seamless loading.
Finally, we introduce MORPHER, a user-friendly GUI integrating the functionalities of both SURAGH and TASHEEH. This interface empowers users to access the pipeline's features through visual elements. Our extensive experiments demonstrate the effectiveness of our data preparation systems, requiring no user involvement. Both SURAGH and TASHEEH outperform existing state-of-the-art methods significantly in both precision and recall.
Organizations are investing billions on innovation and agility initiatives to stay competitive in their increasingly uncertain business environments. Design Thinking, an innovation approach based on human-centered exploration, ideation and experimentation, has gained increasing popularity. The market for Design Thinking, including software products and general services, is projected to reach 2.500 million $ (US-Dollar) by 2028. A dispersed set of positive outcomes have been attributed to Design Thinking. However, there is no clear understanding of what exactly comprises the impact of Design Thinking and how it is created. To support a billion-dollar market, it is essential to understand the value Design Thinking is bringing to organizations not only to justify large investments, but to continuously improve the approach and its application.
Following a qualitative research approach combined with results from a systematic literature review, the results presented in this dissertation offer a structured understanding of Design Thinking impact. The results are structured along two main perspectives of impact: the individual and the organizational perspective. First, insights from qualitative data analysis demonstrate that measuring and assessing the impact of Design Thinking is currently one central challenge for Design Thinking practitioners in organizations. Second, the interview data revealed several effects Design Thinking has on individuals, demonstrating how Design Thinking can impact boundary management behaviors and enable employees to craft their jobs more actively.
Contributing to innovation management research, the work presented in this dissertation systematically explains the Design Thinking impact, allowing other researchers to both locate and integrate their work better. The results of this research advance the theoretical rigor of Design Thinking impact research, offering multiple theoretical underpinnings explaining the variety of Design Thinking impact. Furthermore, this dissertation contains three specific propositions on how Design Thinking creates an impact: Design Thinking creates an impact through integration, enablement, and engagement. Integration refers to how Design Thinking enables organizations through effectively combining things, such as for example fostering balance between exploitation and exploration activities. Through Engagement, Design Thinking impacts organizations involving users and other relevant stakeholders in their work. Moreover, Design Thinking creates impact through Enablement, making it possible for individuals to enact a specific behavior or experience certain states.
By synthesizing multiple theoretical streams into these three overarching themes, the results of this research can help bridge disciplinary boundaries, for example between business, psychology and design, and enhance future collaborative research. Practitioners benefit from the results as multiple desirable outcomes are detailed in this thesis, such as successful individual job crafting behaviors, which can be expected from practicing Design Thinking. This allows practitioners to enact more evidence-based decision-making concerning Design Thinking implementation. Overall, considering multiple levels of impact as well as a broad range of theoretical underpinnings are paramount to understanding and fostering Design Thinking impact.
Defining the metaverse
(2023)
The term Metaverse is emerging as a result of the late push by multinational technology conglomerates and a recent surge of interest in Web 3.0, Blockchain, NFT, and Cryptocurrencies. From a scientific point of view, there is no definite consensus on what the Metaverse will be like. This paper collects, analyzes, and synthesizes scientific definitions and the accompanying major characteristics of the Metaverse using the methodology of a Systematic Literature Review (SLR). Two revised definitions for the Metaverse are presented, both condensing the key attributes, where the first one is rather simplistic holistic describing “a three-dimensional online environment in which users represented by avatars interact with each other in virtual spaces decoupled from the real physical world”. In contrast, the second definition is specified in a more detailed manner in the paper and further discussed. These comprehensive definitions offer specialized and general scholars an application within and beyond the scientific context of the system science, information system science, computer science, and business informatics, by also introducing open research challenges. Furthermore, an outlook on the social, economic, and technical implications is given, and the preconditions that are necessary for a successful implementation are discussed.
In the last two decades, process mining has developed from a niche
discipline to a significant research area with considerable impact on academia and industry. Process mining enables organisations to identify the running business processes from historical execution data. The first requirement of any process mining technique is an event log, an artifact that represents concrete business process executions in the form of sequence of events. These logs can be extracted from the organization's information systems and are used by process experts to retrieve deep insights from the organization's running processes. Considering the events pertaining to such logs, the process models can be automatically discovered and enhanced or annotated with performance-related information. Besides behavioral information, event logs contain domain specific data, albeit implicitly. However, such data are usually overlooked and, thus, not utilized to their full potential.
Within the process mining area, we address in this thesis the research gap of discovering, from event logs, the contextual information that cannot be captured by applying existing process mining techniques. Within this research gap, we identify four key problems and tackle them by looking at an event log from different angles. First, we address the problem of deriving an event log in the absence of a proper database access and domain knowledge. The second problem is related to the under-utilization of the implicit domain knowledge present in an event log that can increase the understandability of the discovered process model. Next, there is a lack of a holistic representation of the historical data manipulation at the process model level of abstraction. Last but not least, each process model presumes to be independent of other process models when discovered from an event log, thus, ignoring possible data dependencies between processes within an organization.
For each of the problems mentioned above, this thesis proposes a dedicated method. The first method provides a solution to extract an event log only from the transactions performed on the database that are stored in the form of redo logs. The second method deals with discovering the underlying data model that is implicitly embedded in the event log, thus, complementing the discovered process model with important domain knowledge information. The third method captures, on the process model level, how the data affects the running process instances. Lastly, the fourth method is about the discovery of the relations between business processes (i.e., how they exchange data) from a set of event logs and explicitly representing such complex interdependencies in a business process architecture.
All the methods introduced in this thesis are implemented as a prototype and their feasibility is proven by being applied on real-life event logs.
In model-driven engineering, the adaptation of large software systems with dynamic structure is enabled by architectural runtime models. Such a model represents an abstract state of the system as a graph of interacting components. Every relevant change in the system is mirrored in the model and triggers an evaluation of model queries, which search the model for structural patterns that should be adapted. This thesis focuses on a type of runtime models where the expressiveness of the model and model queries is extended to capture past changes and their timing. These history-aware models and temporal queries enable more informed decision-making during adaptation, as they support the formulation of requirements on the evolution of the pattern that should be adapted. However, evaluating temporal queries during adaptation poses significant challenges. First, it implies the capability to specify and evaluate requirements on the structure, as well as the ordering and timing in which structural changes occur. Then, query answers have to reflect that the history-aware model represents the architecture of a system whose execution may be ongoing, and thus answers may depend on future changes. Finally, query evaluation needs to be adequately fast and memory-efficient despite the increasing size of the history---especially for models that are altered by numerous, rapid changes.
The thesis presents a query language and a querying approach for the specification and evaluation of temporal queries. These contributions aim to cope with the challenges of evaluating temporal queries at runtime, a prerequisite for history-aware architectural monitoring and adaptation which has not been systematically treated by prior model-based solutions. The distinguishing features of our contributions are: the specification of queries based on a temporal logic which encodes structural patterns as graphs; the provision of formally precise query answers which account for timing constraints and ongoing executions; the incremental evaluation which avoids the re-computation of query answers after each change; and the option to discard history that is no longer relevant to queries. The query evaluation searches the model for occurrences of a pattern whose evolution satisfies a temporal logic formula. Therefore, besides model-driven engineering, another related research community is runtime verification. The approach differs from prior logic-based runtime verification solutions by supporting the representation and querying of structure via graphs and graph queries, respectively, which is more efficient for queries with complex patterns. We present a prototypical implementation of the approach and measure its speed and memory consumption in monitoring and adaptation scenarios from two application domains, with executions of an increasing size. We assess scalability by a comparison to the state-of-the-art from both related research communities. The implementation yields promising results, which pave the way for sophisticated history-aware self-adaptation solutions and indicate that the approach constitutes a highly effective technique for runtime monitoring on an architectural level.
Most machine learning methods provide only point estimates when being queried to predict on new data. This is problematic when the data is corrupted by noise, e.g. from imperfect measurements, or when the queried data point is very different to the data that the machine learning model has been trained with. Probabilistic modelling in machine learning naturally equips predictions with corresponding uncertainty estimates which allows a practitioner to incorporate information about measurement noise into the modelling process and to know when not to trust the predictions. A well-understood, flexible probabilistic framework is provided by Gaussian processes that are ideal as building blocks of probabilistic models. They lend themself naturally to the problem of regression, i.e., being given a set of inputs and corresponding observations and then predicting likely observations for new unseen inputs, and can also be adapted to many more machine learning tasks. However, exactly inferring the optimal parameters of such a Gaussian process model (in a computationally tractable manner) is only possible for regression tasks in small data regimes. Otherwise, approximate inference methods are needed, the most prominent of which is variational inference.
In this dissertation we study models that are composed of Gaussian processes embedded in other models in order to make those more flexible and/or probabilistic. The first example are deep Gaussian processes which can be thought of as a small network of Gaussian processes and which can be employed for flexible regression. The second model class that we study are Gaussian process state-space models. These can be used for time-series modelling, i.e., the task of being given a stream of data ordered by time and then predicting future observations. For both model classes the state-of-the-art approaches offer a trade-off between expressive models and computational properties (e.g. speed or convergence properties) and mostly employ variational inference. Our goal is to improve inference in both models by first getting a deep understanding of the existing methods and then, based on this, to design better inference methods. We achieve this by either exploring the existing trade-offs or by providing general improvements applicable to multiple methods.
We first provide an extensive background, introducing Gaussian processes and their sparse (approximate and efficient) variants. We continue with a description of the models under consideration in this thesis, deep Gaussian processes and Gaussian process state-space models, including detailed derivations and a theoretical comparison of existing methods.
Then we start analysing deep Gaussian processes more closely: Trading off the properties (good optimisation versus expressivity) of state-of-the-art methods in this field, we propose a new variational inference based approach. We then demonstrate experimentally that our new algorithm leads to better calibrated uncertainty estimates than existing methods.
Next, we turn our attention to Gaussian process state-space models, where we closely analyse the theoretical properties of existing methods.The understanding gained in this process leads us to propose a new inference scheme for general Gaussian process state-space models that incorporates effects on multiple time scales. This method is more efficient than previous approaches for long timeseries and outperforms its comparison partners on data sets in which effects on multiple time scales (fast and slowly varying dynamics) are present.
Finally, we propose a new inference approach for Gaussian process state-space models that trades off the properties of state-of-the-art methods in this field. By combining variational inference with another approximate inference method, the Laplace approximation, we design an efficient algorithm that outperforms its comparison partners since it achieves better calibrated uncertainties.
Today, point clouds are among the most important categories of spatial data, as they constitute digital 3D models of the as-is reality that can be created at unprecedented speed and precision. However, their unique properties, i.e., lack of structure, order, or connectivity information, necessitate specialized data structures and algorithms to leverage their full precision. In particular, this holds true for the interactive visualization of point clouds, which requires to balance hardware limitations regarding GPU memory and bandwidth against a naturally high susceptibility to visual artifacts.
This thesis focuses on concepts, techniques, and implementations of robust, scalable, and portable 3D visualization systems for massive point clouds. To that end, a number of rendering, visualization, and interaction techniques are introduced, that extend several basic strategies to decouple rendering efforts and data management: First, a novel visualization technique that facilitates context-aware filtering, highlighting, and interaction within point cloud depictions. Second, hardware-specific optimization techniques that improve rendering performance and image quality in an increasingly diversified hardware landscape. Third, natural and artificial locomotion techniques for nausea-free exploration in the context of state-of-the-art virtual reality devices. Fourth, a framework for web-based rendering that enables collaborative exploration of point clouds across device ecosystems and facilitates the integration into established workflows and software systems.
In cooperation with partners from industry and academia, the practicability and robustness of the presented techniques are showcased via several case studies using representative application scenarios and point cloud data sets. In summary, the work shows that the interactive visualization of point clouds can be implemented by a multi-tier software architecture with a number of domain-independent, generic system components that rely on optimization strategies specific to large point clouds. It demonstrates the feasibility of interactive, scalable point cloud visualization as a key component for distributed IT solutions that operate with spatial digital twins, providing arguments in favor of using point clouds as a universal type of spatial base data usable directly for visualization purposes.
The amount of data stored in databases and the complexity of database workloads are ever- increasing. Database management systems (DBMSs) offer many configuration options, such as index creation or unique constraints, which must be adapted to the specific instance to efficiently process large volumes of data. Currently, such database optimization is complicated, manual work performed by highly skilled database administrators (DBAs). In cloud scenarios, manual database optimization even becomes infeasible: it exceeds the abilities of the best DBAs due to the enormous number of deployed DBMS instances (some providers maintain millions of instances), missing domain knowledge resulting from data privacy requirements, and the complexity of the configuration tasks.
Therefore, we investigate how to automate the configuration of DBMSs efficiently with the help of unsupervised database optimization. While there are numerous configuration options, in this thesis, we focus on automatic index selection and the use of data dependencies, such as functional dependencies, for query optimization. Both aspects have an extensive performance impact and complement each other by approaching unsupervised database optimization from different perspectives.
Our contributions are as follows: (1) we survey automated state-of-the-art index selection algorithms regarding various criteria, e.g., their support for index interaction. We contribute an extensible platform for evaluating the performance of such algorithms with industry-standard datasets and workloads. The platform is well-received by the community and has led to follow-up research. With our platform, we derive the strengths and weaknesses of the investigated algorithms. We conclude that existing solutions often have scalability issues and cannot quickly determine (near-)optimal solutions for large problem instances. (2) To overcome these limitations, we present two new algorithms. Extend determines (near-)optimal solutions with an iterative heuristic. It identifies the best index configurations for the evaluated benchmarks. Its selection runtimes are up to 10 times lower compared with other near-optimal approaches. SWIRL is based on reinforcement learning and delivers solutions instantly. These solutions perform within 3 % of the optimal ones. Extend and SWIRL are available as open-source implementations.
(3) Our index selection efforts are complemented by a mechanism that analyzes workloads to determine data dependencies for query optimization in an unsupervised fashion. We describe and classify 58 query optimization techniques based on functional, order, and inclusion dependencies as well as on unique column combinations. The unsupervised mechanism and three optimization techniques are implemented in our open-source research DBMS Hyrise. Our approach reduces the Join Order Benchmark’s runtime by 26 % and accelerates some TPC-DS queries by up to 58 times.
Additionally, we have developed a cockpit for unsupervised database optimization that allows interactive experiments to build confidence in such automated techniques. In summary, our contributions improve the performance of DBMSs, support DBAs in their work, and enable them to contribute their time to other, less arduous tasks.
In this bachelor’s thesis I implement the automatic theorem prover nanoCoP-Ω. This system is the result of porting arithmetic and equality handling procedures first introduced in the automatic theorem prover with arithmetic leanCoP-Ω into the similar system nanoCoP 2.0. To understand these procedures, I first introduce the mathematical background to both automatic theorem proving and arithmetic expressions. I present the predecessor projects leanCoP, nanoCoP and leanCoP-Ω, out of which nanCoP-Ω was developed. This is followed by an extensive description of the concepts the non-clausal connection calculus needed to be extended by, to allow for proving arithmetic expressions and equalities, as well as of their implementation into nanoCoP-Ω. An extensive comparison between both the runtimes and the number of solved problems of the systems nanoCoP-Ω and leanCoP-Ω was made. I come to the conclusion, that nanoCoP-Ω is considerably faster than leanCoP-Ω for small problems, though less well suited for larger problems. Additionally, I was able to construct a non-theorem that nanoCoP-Ω generates a false proof for. I discuss how this pressing issue could be resolved, as well as some possible optimizations and expansions of the system.
Residential segregation is a widespread phenomenon that can be observed in almost every major city. In these urban areas, residents with different ethnical or socioeconomic backgrounds tend to form homogeneous clusters. In Schelling’s classical segregation model two types of agents are placed on a grid. An agent is content with its location if the fraction of its neighbors, which have the same type as the agent, is at least 𝜏, for some 0 < 𝜏 ≤ 1. Discontent agents simply swap their location with a randomly chosen other discontent agent or jump to a random empty location. The model gives a coherent explanation of how clusters can form even if all agents are tolerant, i.e., if they agree to live in mixed neighborhoods. For segregation to occur, all it needs is a slight bias towards agents preferring similar neighbors.
Although the model is well studied, previous research focused on a random process point of view. However, it is more realistic to assume instead that the agents strategically choose where to live. We close this gap by introducing and analyzing game-theoretic models of Schelling segregation, where rational agents strategically choose their locations.
As the first step, we introduce and analyze a generalized game-theoretic model that allows more than two agent types and more general underlying graphs modeling the residential area. We introduce different versions of Swap and Jump Schelling Games. Swap Schelling Games assume that every vertex of the underlying graph serving as a residential area is occupied by an agent and pairs of discontent agents can swap their locations, i.e., their occupied vertices, to increase their utility. In contrast, for the Jump Schelling Game, we assume that there exist empty vertices in the graph and agents can jump to these vacant vertices if this increases their utility. We show that the number of agent types as well as the structure of underlying graph heavily influence the dynamic properties and the tractability of finding an optimal strategy profile.
As a second step, we significantly deepen these investigations for the swap version with 𝜏 = 1 by studying the influence of the underlying topology modeling the residential area on the existence of equilibria, the Price of Anarchy, and the dynamic properties. Moreover, we restrict the movement of agents locally. As a main takeaway, we find that both aspects influence the existence and the quality of stable states.
Furthermore, also for the swap model, we follow sociological surveys and study, asking the same core game-theoretic questions, non-monotone singlepeaked utility functions instead of monotone ones, i.e., utility functions that are not monotone in the fraction of same-type neighbors. Our results clearly show that moving from monotone to non-monotone utilities yields novel structural properties and different results in terms of existence and quality of stable states.
In the last part, we introduce an agent-based saturated open-city variant, the Flip Schelling Process, in which agents, based on the predominant type in their neighborhood, decide whether to change their types. We provide a general framework for analyzing the influence of the underlying topology on residential segregation and investigate the probability that an edge is monochrome, i.e., that both incident vertices have the same type, on random geometric and Erdős–Rényi graphs. For random geometric graphs, we prove the existence of a constant c > 0 such that the expected fraction of monochrome edges after the Flip Schelling Process is at least 1/2 + c. For Erdős–Rényi graphs, we show the expected fraction of monochrome edges after the Flip Schelling Process is at most 1/2 + o(1).
In this thesis, we investigate language learning in the formalisation of Gold [Gol67]. Here, a learner, being successively presented all information of a target language, conjectures which language it believes to be shown. Once these hypotheses converge syntactically to a correct explanation of the target language, the learning is considered successful. Fittingly, this is termed explanatory learning. To model learning strategies, we impose restrictions on the hypotheses made, for example requiring the conjectures to follow a monotonic behaviour. This way, we can study the impact a certain restriction has on learning.
Recently, the literature shifted towards map charting. Here, various seemingly unrelated restrictions are contrasted, unveiling interesting relations between them. The results are then depicted in maps. For explanatory learning, the literature already provides maps of common restrictions for various forms of data presentation.
In the case of behaviourally correct learning, where the learners are required to converge semantically instead of syntactically, the same restrictions as in explanatory learning have been investigated. However, a similarly complete picture regarding their interaction has not been presented yet.
In this thesis, we transfer the map charting approach to behaviourally correct learning. In particular, we complete the partial results from the literature for many well-studied restrictions and provide full maps for behaviourally correct learning with different types of data presentation. We also study properties of learners assessed important in the literature. We are interested whether learners are consistent, that is, whether their conjectures include the data they are built on. While learners cannot be assumed consistent in explanatory learning, the opposite is the case in behaviourally correct learning. Even further, it is known that learners following different restrictions may be assumed consistent. We contribute to the literature by showing that this is the case for all studied restrictions.
We also investigate mathematically interesting properties of learners. In particular, we are interested in whether learning under a given restriction may be done with strongly Bc-locking learners. Such learners are of particular value as they allow to apply simulation arguments when, for example, comparing two learning paradigms to each other. The literature gives a rich ground on when learners may be assumed strongly Bc-locking, which we complete for all studied restrictions.
Learning the causal structures from observational data is an omnipresent challenge in data science. The amount of observational data available to Causal Structure Learning (CSL) algorithms is increasing as data is collected at high frequency from many data sources nowadays. While processing more data generally yields higher accuracy in CSL, the concomitant increase in the runtime of CSL algorithms hinders their widespread adoption in practice. CSL is a parallelizable problem. Existing parallel CSL algorithms address execution on multi-core Central Processing Units (CPUs) with dozens of compute cores. However, modern computing systems are often heterogeneous and equipped with Graphics Processing Units (GPUs) to accelerate computations. Typically, these GPUs provide several thousand compute cores for massively parallel data processing.
To shorten the runtime of CSL algorithms, we design efficient execution strategies that leverage the parallel processing power of GPUs. Particularly, we derive GPU-accelerated variants of a well-known constraint-based CSL method, the PC algorithm, as it allows choosing a statistical Conditional Independence test (CI test) appropriate to the observational data characteristics.
Our two main contributions are: (1) to reflect differences in the CI tests, we design three GPU-based variants of the PC algorithm tailored to CI tests that handle data with the following characteristics. We develop one variant for data assuming the Gaussian distribution model, one for discrete data, and another for mixed discrete-continuous data and data with non-linear relationships. Each variant is optimized for the appropriate CI test leveraging GPU hardware properties, such as shared or thread-local memory. Our GPU-accelerated variants outperform state-of-the-art parallel CPU-based algorithms by factors of up to 93.4× for data assuming the Gaussian distribution model, up to 54.3× for discrete data, up to 240× for continuous data with non-linear relationships and up to 655× for mixed discrete-continuous data. However, the proposed GPU-based variants are limited to datasets that fit into a single GPU’s memory. (2) To overcome this shortcoming, we develop approaches to scale our GPU-based variants beyond a single GPU’s memory capacity. For example, we design an out-of-core GPU variant that employs explicit memory management to process arbitrary-sized datasets. Runtime measurements on a large gene expression dataset reveal that our out-of-core GPU variant is 364 times faster than a parallel CPU-based CSL algorithm. Overall, our proposed GPU-accelerated variants speed up CSL in numerous settings to foster CSL’s adoption in practice and research.