004 Datenverarbeitung; Informatik
Refine
Year of publication
Document Type
- Article (223)
- Doctoral Thesis (128)
- Monograph/Edited Volume (123)
- Postprint (50)
- Conference Proceeding (43)
- Other (5)
- Master's Thesis (3)
- Preprint (3)
- Part of a Book (2)
- Bachelor Thesis (1)
Language
- English (582) (remove)
Keywords
- machine learning (17)
- answer set programming (12)
- Cloud Computing (9)
- Hasso-Plattner-Institut (9)
- cloud computing (9)
- Forschungskolleg (8)
- Forschungsprojekte (8)
- Future SOC Lab (8)
- Hasso Plattner Institute (8)
- In-Memory Technologie (8)
Institute
- Hasso-Plattner-Institut für Digital Engineering gGmbH (180)
- Institut für Informatik und Computational Science (161)
- Hasso-Plattner-Institut für Digital Engineering GmbH (119)
- Extern (49)
- Mathematisch-Naturwissenschaftliche Fakultät (24)
- Fachgruppe Betriebswirtschaftslehre (23)
- Wirtschaftswissenschaften (18)
- Digital Engineering Fakultät (8)
- Institut für Physik und Astronomie (8)
- Institut für Mathematik (7)
We present a general approach to planning with incomplete information in Answer Set Programming (ASP). More precisely, we consider the problems of conformant and conditional planning with sensing actions and assumptions. We represent planning problems using a simple formalism where logic programs describe the transition function between states, the initial states and the goal states. For solving planning problems, we use Quantified Answer Set Programming (QASP), an extension of ASP with existential and universal quantifiers over atoms that is analogous to Quantified Boolean Formulas (QBFs). We define the language of quantified logic programs and use it to represent the solutions different variants of conformant and conditional planning. On the practical side, we present a translation-based QASP solver that converts quantified logic programs into QBFs and then executes a QBF solver, and we evaluate experimentally the approach on conformant and conditional planning benchmarks.
Business processes are often specified in descriptive or normative models. Both types of models should adhere to internal and external regulations, such as company guidelines or laws. Employing compliance checking techniques, it is possible to verify process models against rules. While traditionally compliance checking focuses on well-structured processes, we address case management scenarios. In case management, knowledge workers drive multi-variant and adaptive processes. Our contribution is based on the fragment-based case management approach, which splits a process into a set of fragments. The fragments are synchronized through shared data but can, otherwise, be dynamically instantiated and executed. We formalize case models using Petri nets. We demonstrate the formalization for design-time and run-time compliance checking and present a proof-of-concept implementation. The application of the implemented compliance checking approach to a use case exemplifies its effectiveness while designing a case model. The empirical evaluation on a set of case models for measuring the performance of the approach shows that rules can often be checked in less than a second.
I can see it in your eyes
(2021)
Over the past years, extensive research has been dedicated to developing robust platforms and data-driven dialog models to support long-term human-robot interactions. However, little is known about how people's perception of robots and engagement with them develop over time and how these can be accurately assessed through implicit and continuous measurement techniques. In this paper, we explore this by involving participants in three interaction sessions with multiple days of zero exposure in between. Each session consists of a joint task with a robot as well as two short social chats with it before and after the task. We measure participants' gaze patterns with a wearable eye-tracker and gauge their perception of the robot and engagement with it and the joint task using questionnaires. Results disclose that aversion of gaze in a social chat is an indicator of a robot's uncanniness and that the more people gaze at the robot in a joint task, the worse they perform. In contrast with most HRI literature, our results show that gaze toward an object of shared attention, rather than gaze toward a robotic partner, is the most meaningful predictor of engagement in a joint task. Furthermore, the analyses of gaze patterns in repeated interactions disclose that people's mutual gaze in a social chat develops congruently with their perceptions of the robot over time. These are key findings for the HRI community as they entail that gaze behavior can be used as an implicit measure of people's perception of robots in a social chat and of their engagement and task performance in a joint task.
A simplified run time analysis of the univariate marginal distribution algorithm on LeadingOnes
(2021)
With elementary means, we prove a stronger run time guarantee for the univariate marginal distribution algorithm (UMDA) optimizing the LEADINGONES benchmark function in the desirable regime with low genetic drift. If the population size is at least quasilinear, then, with high probability, the UMDA samples the optimum in a number of iterations that is linear in the problem size divided by the logarithm of the UMDA's selection rate. This improves over the previous guarantee, obtained by Dang and Lehre (2015) via the deep level-based population method, both in terms of the run time and by demonstrating further run time gains from small selection rates. Under similar assumptions, we prove a lower bound that matches our upper bound up to constant factors.
This technical report presents the results of student projects which were prepared during the lecture “Operating Systems II” offered by the “Operating Systems and Middleware” group at HPI in the Summer term of 2020. The lecture covered ad- vanced aspects of operating system implementation and architecture on topics such as Virtualization, File Systems and Input/Output Systems. In addition to attending the lecture, the participating students were encouraged to gather practical experience by completing a project on a closely related topic over the course of the semester. The results of 10 selected exceptional projects are covered in this report.
The students have completed hands-on projects on the topics of Operating System Design Concepts and Implementation, Hardware/Software Co-Design, Reverse Engineering, Quantum Computing, Static Source-Code Analysis, Operating Systems History, Application Binary Formats and more. It should be recognized that over the course of the semester all of these projects have achieved outstanding results which went far beyond the scope and the expec- tations of the lecture, and we would like to thank all participating students for their commitment and their effort in completing their respective projects, as well as their work on compiling this report.
Ethical issues surrounding modern computing technologies play an increasingly important role in the public debate. Yet, ethics still either doesn’t appear at all or only to a very small extent in computer science degree programs. This paper provides an argument for the value of ethics beyond a pure responsibility perspective and describes the positive value of ethical debate for future computer scientists. It also provides a systematic analysis of the module handbooks of 67 German universities and shows that there is indeed a lack of ethics in computer science education. Finally, we present a principled design of a compulsory course for undergraduate students.
Digital technology offers significant political, economic, and societal opportunities. At the same time, the notion of digital sovereignty has become a leitmotif in German discourse: the state’s capacity to assume its responsibilities and safeguard society’s – and individuals’ – ability to shape the digital transformation in a self-determined way. The education sector is exemplary for the challenge faced by Germany, and indeed Europe, of harnessing the benefits of digital technology while navigating concerns around sovereignty. It encompasses education as a core public good, a rapidly growing field of business, and growing pools of highly sensitive personal data. The report describes pathways to mitigating the tension between digitalization and sovereignty at three different levels – state, economy, and individual – through the lens of concrete technical projects in the education sector: the HPI Schul-Cloud (state sovereignty), the MERLOT data spaces (economic sovereignty), and the openHPI platform (individual sovereignty).
A degree course in IT and business administration solely for women (FIW) has been offered since 2009 at the HTW Berlin – University of Applied Sciences. This contribution discusses student motivations for enrolling in such a women only degree course and gives details of our experience over recent years. In particular, the approach to attracting new female students is described and the composition of the intake is discussed. It is shown that the women-only setting together with other factors can attract a new clientele for computer science.
Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.
We consider the subset selection problem for function f with constraint bound B that changes over time. Within the area of submodular optimization, various greedy approaches are commonly used. For dynamic environments we observe that the adaptive variants of these greedy approaches are not able to maintain their approximation quality. Investigating the recently introduced POMC Pareto optimization approach, we show that this algorithm efficiently computes a phi=(alpha(f)/2)(1 - 1/e(alpha)f)-approximation, where alpha(f) is the submodularity ratio of f, for each possible constraint bound b <= B. Furthermore, we show that POMC is able to adapt its set of solutions quickly in the case that B increases. Our experimental investigations for the influence maximization in social networks show the advantage of POMC over generalized greedy algorithms. We also consider EAMC, a new evolutionary algorithm with polynomial expected time guarantee to maintain phi approximation ratio, and NSGA-II with two different population sizes as advanced multi-objective optimization algorithm, to demonstrate their challenges in optimizing the maximum coverage problem. Our empirical analysis shows that, within the same number of evaluations, POMC is able to perform as good as NSGA-II under linear constraint, while EAMC performs significantly worse than all considered algorithms in most cases.
The Security Operations Center (SOC) represents a specialized unit responsible for managing security within enterprises. To aid in its responsibilities, the SOC relies heavily on a Security Information and Event Management (SIEM) system that functions as a centralized repository for all security-related data, providing a comprehensive view of the organization's security posture. Due to the ability to offer such insights, SIEMS are considered indispensable tools facilitating SOC functions, such as monitoring, threat detection, and incident response.
Despite advancements in big data architectures and analytics, most SIEMs fall short of keeping pace. Architecturally, they function merely as log search engines, lacking the support for distributed large-scale analytics. Analytically, they rely on rule-based correlation, neglecting the adoption of more advanced data science and machine learning techniques.
This thesis first proposes a blueprint for next-generation SIEM systems that emphasize distributed processing and multi-layered storage to enable data mining at a big data scale. Next, with the architectural support, it introduces two data mining approaches for advanced threat detection as part of SOC operations.
First, a novel graph mining technique that formulates threat detection within the SIEM system as a large-scale graph mining and inference problem, built on the principles of guilt-by-association and exempt-by-reputation. The approach entails the construction of a Heterogeneous Information Network (HIN) that models shared characteristics and associations among entities extracted from SIEM-related events/logs. Thereon, a novel graph-based inference algorithm is used to infer a node's maliciousness score based on its associations with other entities in the HIN. Second, an innovative outlier detection technique that imitates a SOC analyst's reasoning process to find anomalies/outliers. The approach emphasizes explainability and simplicity, achieved by combining the output of simple context-aware univariate submodels that calculate an outlier score for each entry.
Both approaches were tested in academic and real-world settings, demonstrating high performance when compared to other algorithms as well as practicality alongside a large enterprise's SIEM system.
This thesis establishes the foundation for next-generation SIEM systems that can enhance today's SOCs and facilitate the transition from human-centric to data-driven security operations.
Diversity is a term that is broadly used and challenging for informatics research, development and education. Diversity concerns may relate to unequal participation, knowledge and methodology, curricula, institutional planning etc. For a lot of these areas, measures, guidelines and best practices on diversity awareness exist. A systemic, sustainable impact of diversity measures on informatics is still largely missing. In this paper I explore what working with diversity and gender concepts in informatics entails, what the main challenges are and provide thoughts for improvement. The paper includes definitions of diversity and intersectionality, reflections on the disciplinary basis of informatics and practical implications of integrating diversity in informatics research and development. In the final part, two concepts from the social sciences and the humanities, the notion of “third space”/hybridity and the notion of “feminist ethics of care”, serve as a lens to foster more sustainable ways of working with diversity in informatics.
Recently, initial conflicts were introduced in the framework of M-adhesive categories as an important optimization of critical pairs. In particular, they represent a proper subset such that each conflict is represented in a minimal context by a unique initial one. The theory of critical pairs has been extended in the framework of M-adhesive categories to rules with nested application conditions (ACs), restricting the applicability of a rule and generalizing the well-known negative application conditions. A notion of initial conflicts for rules with ACs does not exist yet.
In this paper, on the one hand, we extend the theory of initial conflicts in the framework of M-adhesive categories to transformation rules with ACs. They represent a proper subset again of critical pairs for rules with ACs, and represent each conflict in a minimal context uniquely. They are moreover symbolic because we can show that in general no finite and complete set of conflicts for rules with ACs exists. On the other hand, we show that critical pairs are minimally M-complete, whereas initial conflicts are minimally complete. Finally, we introduce important special cases of rules with ACs for which we can obtain finite, minimally (M-)complete sets of conflicts.
In this bachelor’s thesis I implement the automatic theorem prover nanoCoP-Ω. This system is the result of porting arithmetic and equality handling procedures first introduced in the automatic theorem prover with arithmetic leanCoP-Ω into the similar system nanoCoP 2.0. To understand these procedures, I first introduce the mathematical background to both automatic theorem proving and arithmetic expressions. I present the predecessor projects leanCoP, nanoCoP and leanCoP-Ω, out of which nanCoP-Ω was developed. This is followed by an extensive description of the concepts the non-clausal connection calculus needed to be extended by, to allow for proving arithmetic expressions and equalities, as well as of their implementation into nanoCoP-Ω. An extensive comparison between both the runtimes and the number of solved problems of the systems nanoCoP-Ω and leanCoP-Ω was made. I come to the conclusion, that nanoCoP-Ω is considerably faster than leanCoP-Ω for small problems, though less well suited for larger problems. Additionally, I was able to construct a non-theorem that nanoCoP-Ω generates a false proof for. I discuss how this pressing issue could be resolved, as well as some possible optimizations and expansions of the system.
Bitcoin is gaining traction as an alternative store of value. Its market capitalization transcends all other cryptocurrencies in the market. But its high monetary value also makes it an attractive target to cyber criminal actors. Hacking campaigns usually target an ecosystem's weakest points. In Bitcoin, the exchange platforms are one of them. Each exchange breach is a threat not only to direct victims, but to the credibility of Bitcoin's entire ecosystem. Based on an extensive analysis of 36 breaches of Bitcoin exchanges, we show the attack patterns used to exploit Bitcoin exchange platforms using an industry standard for reporting intelligence on cyber security breaches. Based on this we are able to provide an overview of the most common attack vectors, showing that all except three hacks were possible due to relatively lax security. We show that while the security regimen of Bitcoin exchanges is subpar compared to other financial service providers, the use of stolen credentials, which does not require any hacking, is decreasing. We also show that the amount of BTC taken during a breach is decreasing, as well as the exchanges that terminate after being breached. Furthermore we show that overall security posture has improved, but still has major flaws. To discover adversarial methods post-breach, we have analyzed two cases of BTC laundering. Through this analysis we provide insight into how exchange platforms with lax cyber security even further increase the intermediary risk introduced by them into the Bitcoin ecosystem.
According to the personalization principle, addressing learners by means of a personalized compared to a nonpersonalized message can foster learning. Interestingly, though, a recent study found that the personalization principle can invert for aversive contents. The present study investigated whether the negative effect of a personalized message for an aversive content can be compensated when learners are in a happy mood. It was hypothesized that the negative effect of a personalized compared to a nonpersonalized message would only be observable for participants in a sad mood, while for participants in a happy mood a personalized message should be beneficial. A 2 x 2 between-subject design with mood (happy vs. sad) and personalization (personalized vs. nonpersonalized message) was used (N = 125 University students). Mood was experimentally varied prior to learning. Learning outcomes were measured by a retention and a transfer test. Results were essentially in line with the assumption: For participants in the sad mood condition, a negative effect of a personalized message was observable for retention and transfer. For participants in the happy mood condition, a positive effect of personalized message was observable for retention, but no effect for transfer. Note that the manipulation check measure for the mood induction procedure did not detect differences between conditions; this may be due to a shortcoming of the used measure (as indicated by an additional evaluation study). The study emphasizes the importance to consider the inherent emotional content of a topic, such as its aversive nature, since the emotional content of a topic can be a boundary condition for design principles in multimedia learning. The study also highlights the complex interplay of externally induced and inherently arising emotions.
Process mining (PM) has established itself in recent years as a main method for visualizing and analyzing processes. However, the identification of knowledge has not been addressed adequately because PM aims solely at data-driven discovering, monitoring, and improving real-world processes from event logs available in various information systems. The following paper, therefore, outlines a novel systematic analysis view on tools for data-driven and machine learning (ML)-based identification of knowledge-intensive target processes. To support the effectiveness of the identification process, the main contributions of this study are (1) to design a procedure for a systematic review and analysis for the selection of relevant dimensions, (2) to identify different categories of dimensions as evaluation metrics to select source systems, algorithms, and tools for PM and ML as well as include them in a multi-dimensional grid box model, (3) to select and assess the most relevant dimensions of the model, (4) to identify and assess source systems, algorithms, and tools in order to find evidence for the selected dimensions, and (5) to assess the relevance and applicability of the conceptualization and design procedure for tool selection in data-driven and ML-based process mining research.
Image feature detection is a key task in computer vision. Scale Invariant Feature Transform (SIFT) is a prevalent and well known algorithm for robust feature detection. However, it is computationally demanding and software implementations are not applicable for real-time performance. In this paper, a versatile and pipelined hardware implementation is proposed, that is capable of computing keypoints and rotation invariant descriptors on-chip. All computations are performed in single precision floating-point format which makes it possible to implement the original algorithm with little alteration. Various rotation resolutions and filter kernel sizes are supported for images of any resolution up to ultra-high definition. For full high definition images, 84 fps can be processed. Ultra high definition images can be processed at 21 fps.
Perfectionism is a personality disposition characterized by setting extremely high performance-standards coupled with critical self-evaluations. Often conceived as positive, perfectionism can yield not only beneficial but also deleterious outcomes ranging from anxiety to burnout. In this proposal, we set out to investigate the role of the technology and, particularly, social media in individuals’ strivings for perfection. We lay down theoretical bases for the possibility that social media plays a role in the development of perfectionism. To empirically test the hypothesized relationship, we propose a comprehensive study design based on the experience sampling method. Lastly, we provide an overview of the planned analysis and future steps.
Coming back for more
(2022)
Recent spikes in social networking site (SNS) usage times have launched investigations into reasons for excessive SNS usage. Extending research on social factors (i.e., fear of missing out), this study considers the News Feed setup. More specifically, we suggest that the order of the News Feed (chronological vs. algorithmically assembled posts) affects usage behaviors. Against the background of the variable reward schedule, this study hypothesizes that the different orders exert serendipity differently. Serendipity, termed as unexpected lucky encounters with information, resembles variable rewards. Studies have evidenced a relation between variable rewards and excessive behaviors. Similarly, we hypothesize that order-induced serendipitous encounters affect SNS usage times and explore this link in a two-wave survey with an experimental setup (users using either chronological or algorithmic News Feeds). While theoretically extending explanations for increased SNS usage times by considering the News Feed order, practically the study will offer recommendations for relevant stakeholders.
Active use of social networking sites (SNSs) has long been assumed to benefit users' well-being. However, this established hypothesis is increasingly being challenged, with scholars criticizing its lack of empirical support and the imprecise conceptualization of active use. Nevertheless, with considerable heterogeneity among existing studies on the hypothesis and causal evidence still limited, a final verdict on its robustness is still pending. To contribute to this ongoing debate, we conducted a week-long randomized control trial with N = 381 adult Instagram users recruited via Prolific. Specifically, we tested how active SNS use, operationalized as picture postings on Instagram, affects different dimensions of well-being. The results depicted a positive effect on users' positive affect but null findings for other well-being outcomes. The findings broadly align with the recent criticism against the active use hypothesis and support the call for a more nuanced view on the impact of SNSs. <br /> Lay Summary Active use of social networking sites (SNSs) has long been assumed to benefit users' well-being. However, this established assumption is increasingly being challenged, with scholars criticizing its lack of empirical support and the imprecise conceptualization of active use. Nevertheless, with great diversity among conducted studies on the hypothesis and a lack of causal evidence, a final verdict on its viability is still pending. To contribute to this ongoing debate, we conducted a week-long experimental investigation with 381 adult Instagram users. Specifically, we tested how posting pictures on Instagram affects different aspects of well-being. The results of this study depicted a positive effect of posting Instagram pictures on users' experienced positive emotions but no effects on other aspects of well-being. The findings broadly align with the recent criticism against the active use hypothesis and support the call for a more nuanced view on the impact of SNSs on users.
We introduce a new measure of descriptional complexity on finite automata, called the number of active states. Roughly speaking, the number of active states of an automaton A on input w counts the number of different states visited during the most economic computation of the automaton A for the word w. This concept generalizes to finite automata and regular languages in a straightforward way. We show that the number of active states of both finite automata and regular languages is computable, even with respect to nondeterministic finite automata. We further compare the number of active states to related measures for regular languages. In particular, we show incomparability to the radius of regular languages and that the difference between the number of active states and the total number of states needed in finite automata for a regular language can be of exponential order.
Advances in Web 2.0 technologies have led to the widespread assimilation of electronic commerce platforms as an innovative shopping method and an alternative to traditional shopping. However, due to pro-technology bias, scholars focus more on adopting technology, and slightly less attention has been given to the impact of electronic word of mouth (eWOM) on customers’ intention to use social commerce. This study addresses the gap by examining the intention through exploring the effect of eWOM on males’ and females’ intentions and identifying the mediation of perceived crowding. To this end, we adopted a dual-stage multi-group structural equation modeling and artificial neural network (SEM-ANN) approach. We successfully extended the eWOM concept by integrating negative and positive factors and perceived crowding. The results reveal the causal and non-compensatory relationships between the constructs. The variables supported by the SEM analysis are adopted as the ANN model’s input neurons. According to the natural significance obtained from the ANN approach, males’ intentions to accept social commerce are related mainly to helping the company, followed by core functionalities. In contrast, females are highly influenced by technical aspects and mishandling. The ANN model predicts customers’ intentions to use social commerce with an accuracy of 97%. We discuss the theoretical and practical implications of increasing customers’ intention toward social commerce channels among consumers based on our findings.
User Experience (UX) describes the holistic experience of a user before, during, and after interaction with a platform, product, or service. UX adds value and attraction to their sole functionality and is therefore highly relevant for firms. The increased interest in UX has produced a vast amount of scholarly research since 1983. The research field is, therefore, complex and scattered. Conducting a bibliometric analysis, we aim at structuring the field quantitatively and rather abstractly. We employed citation analyses, co-citation analyses, and content analyses to evaluate productivity and impact of extant research. We suggest that future research should focus more on business and management related topics.
Terminology is a critical instrument for each researcher. Different terminologies for the same research object may arise in different research communities. By this inconsistency, many synergistic effects get lost. Theories and models will be more understandable and reusable if a common terminology is applied. This paper examines the terminological (in)consistence for the research field of job-shop scheduling by a literature review. There is an enormous variety in the choice of terms and mathematical notation for the same concept. The comparability, reusability and combinability of scheduling methods is unnecessarily hampered by the arbitrary use of homonyms and synonyms. The acceptance in the community of used variables and notation forms is shown by means of a compliance quotient. This is proven by the evaluation of 240 scientific publications on planning methods.
In model-driven engineering, the adaptation of large software systems with dynamic structure is enabled by architectural runtime models. Such a model represents an abstract state of the system as a graph of interacting components. Every relevant change in the system is mirrored in the model and triggers an evaluation of model queries, which search the model for structural patterns that should be adapted. This thesis focuses on a type of runtime models where the expressiveness of the model and model queries is extended to capture past changes and their timing. These history-aware models and temporal queries enable more informed decision-making during adaptation, as they support the formulation of requirements on the evolution of the pattern that should be adapted. However, evaluating temporal queries during adaptation poses significant challenges. First, it implies the capability to specify and evaluate requirements on the structure, as well as the ordering and timing in which structural changes occur. Then, query answers have to reflect that the history-aware model represents the architecture of a system whose execution may be ongoing, and thus answers may depend on future changes. Finally, query evaluation needs to be adequately fast and memory-efficient despite the increasing size of the history---especially for models that are altered by numerous, rapid changes.
The thesis presents a query language and a querying approach for the specification and evaluation of temporal queries. These contributions aim to cope with the challenges of evaluating temporal queries at runtime, a prerequisite for history-aware architectural monitoring and adaptation which has not been systematically treated by prior model-based solutions. The distinguishing features of our contributions are: the specification of queries based on a temporal logic which encodes structural patterns as graphs; the provision of formally precise query answers which account for timing constraints and ongoing executions; the incremental evaluation which avoids the re-computation of query answers after each change; and the option to discard history that is no longer relevant to queries. The query evaluation searches the model for occurrences of a pattern whose evolution satisfies a temporal logic formula. Therefore, besides model-driven engineering, another related research community is runtime verification. The approach differs from prior logic-based runtime verification solutions by supporting the representation and querying of structure via graphs and graph queries, respectively, which is more efficient for queries with complex patterns. We present a prototypical implementation of the approach and measure its speed and memory consumption in monitoring and adaptation scenarios from two application domains, with executions of an increasing size. We assess scalability by a comparison to the state-of-the-art from both related research communities. The implementation yields promising results, which pave the way for sophisticated history-aware self-adaptation solutions and indicate that the approach constitutes a highly effective technique for runtime monitoring on an architectural level.
Most machine learning methods provide only point estimates when being queried to predict on new data. This is problematic when the data is corrupted by noise, e.g. from imperfect measurements, or when the queried data point is very different to the data that the machine learning model has been trained with. Probabilistic modelling in machine learning naturally equips predictions with corresponding uncertainty estimates which allows a practitioner to incorporate information about measurement noise into the modelling process and to know when not to trust the predictions. A well-understood, flexible probabilistic framework is provided by Gaussian processes that are ideal as building blocks of probabilistic models. They lend themself naturally to the problem of regression, i.e., being given a set of inputs and corresponding observations and then predicting likely observations for new unseen inputs, and can also be adapted to many more machine learning tasks. However, exactly inferring the optimal parameters of such a Gaussian process model (in a computationally tractable manner) is only possible for regression tasks in small data regimes. Otherwise, approximate inference methods are needed, the most prominent of which is variational inference.
In this dissertation we study models that are composed of Gaussian processes embedded in other models in order to make those more flexible and/or probabilistic. The first example are deep Gaussian processes which can be thought of as a small network of Gaussian processes and which can be employed for flexible regression. The second model class that we study are Gaussian process state-space models. These can be used for time-series modelling, i.e., the task of being given a stream of data ordered by time and then predicting future observations. For both model classes the state-of-the-art approaches offer a trade-off between expressive models and computational properties (e.g. speed or convergence properties) and mostly employ variational inference. Our goal is to improve inference in both models by first getting a deep understanding of the existing methods and then, based on this, to design better inference methods. We achieve this by either exploring the existing trade-offs or by providing general improvements applicable to multiple methods.
We first provide an extensive background, introducing Gaussian processes and their sparse (approximate and efficient) variants. We continue with a description of the models under consideration in this thesis, deep Gaussian processes and Gaussian process state-space models, including detailed derivations and a theoretical comparison of existing methods.
Then we start analysing deep Gaussian processes more closely: Trading off the properties (good optimisation versus expressivity) of state-of-the-art methods in this field, we propose a new variational inference based approach. We then demonstrate experimentally that our new algorithm leads to better calibrated uncertainty estimates than existing methods.
Next, we turn our attention to Gaussian process state-space models, where we closely analyse the theoretical properties of existing methods.The understanding gained in this process leads us to propose a new inference scheme for general Gaussian process state-space models that incorporates effects on multiple time scales. This method is more efficient than previous approaches for long timeseries and outperforms its comparison partners on data sets in which effects on multiple time scales (fast and slowly varying dynamics) are present.
Finally, we propose a new inference approach for Gaussian process state-space models that trades off the properties of state-of-the-art methods in this field. By combining variational inference with another approximate inference method, the Laplace approximation, we design an efficient algorithm that outperforms its comparison partners since it achieves better calibrated uncertainties.
Residential segregation is a widespread phenomenon that can be observed in almost every major city. In these urban areas, residents with different ethnical or socioeconomic backgrounds tend to form homogeneous clusters. In Schelling’s classical segregation model two types of agents are placed on a grid. An agent is content with its location if the fraction of its neighbors, which have the same type as the agent, is at least 𝜏, for some 0 < 𝜏 ≤ 1. Discontent agents simply swap their location with a randomly chosen other discontent agent or jump to a random empty location. The model gives a coherent explanation of how clusters can form even if all agents are tolerant, i.e., if they agree to live in mixed neighborhoods. For segregation to occur, all it needs is a slight bias towards agents preferring similar neighbors.
Although the model is well studied, previous research focused on a random process point of view. However, it is more realistic to assume instead that the agents strategically choose where to live. We close this gap by introducing and analyzing game-theoretic models of Schelling segregation, where rational agents strategically choose their locations.
As the first step, we introduce and analyze a generalized game-theoretic model that allows more than two agent types and more general underlying graphs modeling the residential area. We introduce different versions of Swap and Jump Schelling Games. Swap Schelling Games assume that every vertex of the underlying graph serving as a residential area is occupied by an agent and pairs of discontent agents can swap their locations, i.e., their occupied vertices, to increase their utility. In contrast, for the Jump Schelling Game, we assume that there exist empty vertices in the graph and agents can jump to these vacant vertices if this increases their utility. We show that the number of agent types as well as the structure of underlying graph heavily influence the dynamic properties and the tractability of finding an optimal strategy profile.
As a second step, we significantly deepen these investigations for the swap version with 𝜏 = 1 by studying the influence of the underlying topology modeling the residential area on the existence of equilibria, the Price of Anarchy, and the dynamic properties. Moreover, we restrict the movement of agents locally. As a main takeaway, we find that both aspects influence the existence and the quality of stable states.
Furthermore, also for the swap model, we follow sociological surveys and study, asking the same core game-theoretic questions, non-monotone singlepeaked utility functions instead of monotone ones, i.e., utility functions that are not monotone in the fraction of same-type neighbors. Our results clearly show that moving from monotone to non-monotone utilities yields novel structural properties and different results in terms of existence and quality of stable states.
In the last part, we introduce an agent-based saturated open-city variant, the Flip Schelling Process, in which agents, based on the predominant type in their neighborhood, decide whether to change their types. We provide a general framework for analyzing the influence of the underlying topology on residential segregation and investigate the probability that an edge is monochrome, i.e., that both incident vertices have the same type, on random geometric and Erdős–Rényi graphs. For random geometric graphs, we prove the existence of a constant c > 0 such that the expected fraction of monochrome edges after the Flip Schelling Process is at least 1/2 + c. For Erdős–Rényi graphs, we show the expected fraction of monochrome edges after the Flip Schelling Process is at most 1/2 + o(1).
Today, point clouds are among the most important categories of spatial data, as they constitute digital 3D models of the as-is reality that can be created at unprecedented speed and precision. However, their unique properties, i.e., lack of structure, order, or connectivity information, necessitate specialized data structures and algorithms to leverage their full precision. In particular, this holds true for the interactive visualization of point clouds, which requires to balance hardware limitations regarding GPU memory and bandwidth against a naturally high susceptibility to visual artifacts.
This thesis focuses on concepts, techniques, and implementations of robust, scalable, and portable 3D visualization systems for massive point clouds. To that end, a number of rendering, visualization, and interaction techniques are introduced, that extend several basic strategies to decouple rendering efforts and data management: First, a novel visualization technique that facilitates context-aware filtering, highlighting, and interaction within point cloud depictions. Second, hardware-specific optimization techniques that improve rendering performance and image quality in an increasingly diversified hardware landscape. Third, natural and artificial locomotion techniques for nausea-free exploration in the context of state-of-the-art virtual reality devices. Fourth, a framework for web-based rendering that enables collaborative exploration of point clouds across device ecosystems and facilitates the integration into established workflows and software systems.
In cooperation with partners from industry and academia, the practicability and robustness of the presented techniques are showcased via several case studies using representative application scenarios and point cloud data sets. In summary, the work shows that the interactive visualization of point clouds can be implemented by a multi-tier software architecture with a number of domain-independent, generic system components that rely on optimization strategies specific to large point clouds. It demonstrates the feasibility of interactive, scalable point cloud visualization as a key component for distributed IT solutions that operate with spatial digital twins, providing arguments in favor of using point clouds as a universal type of spatial base data usable directly for visualization purposes.
Motivation:
Constraint-based modeling approaches allow the estimation of maximal in vivo enzyme catalytic rates that can serve as proxies for enzyme turnover numbers. Yet, genome-scale flux profiling remains a challenge in deploying these approaches to catalogue proxies for enzyme catalytic rates across organisms.
Results:
Here, we formulate a constraint-based approach, termed NIDLE-flux, to estimate fluxes at a genome-scale level by using the principle of efficient usage of expressed enzymes. Using proteomics data from Escherichia coli, we show that the fluxes estimated by NIDLE-flux and the existing approaches are in excellent qualitative agreement (Pearson correlation > 0.9). We also find that the maximal in vivo catalytic rates estimated by NIDLE-flux exhibits a Pearson correlation of 0.74 with in vitro enzyme turnover numbers. However, NIDLE-flux results in a 1.4-fold increase in the size of the estimated maximal in vivo catalytic rates in comparison to the contenders. Integration of the maximum in vivo catalytic rates with publically available proteomics and metabolomics data provide a better match to fluxes estimated by NIDLE-flux. Therefore, NIDLE-flux facilitates more effective usage of proteomics data to estimate proxies for kcatomes.
A core operator of evolutionary algorithms (EAs) is the mutation. Recently, much attention has been devoted to the study of mutation operators with dynamic and non-uniform mutation rates. Following up on this area of work, we propose a new mutation operator and analyze its performance on the (1 + 1) Evolutionary Algorithm (EA). Our analyses show that this mutation operator competes with pre-existing ones, when used by the (1 + 1) EA on classes of problems for which results on the other mutation operators are available. We show that the (1 + 1) EA using our mutation operator finds a (1/3)-approximation ratio on any non-negative submodular function in polynomial time. We also consider the problem of maximizing a symmetric submodular function under a single matroid constraint and show that the (1 + 1) EA using our operator finds a (1/3)-approximation within polynomial time. This performance matches that of combinatorial local search algorithms specifically designed to solve these problems and outperforms them with constant probability. Finally, we evaluate the performance of the (1 + 1) EA using our operator experimentally by considering two applications: (a) the maximum directed cut problem on real-world graphs of different origins, with up to 6.6 million vertices and 56 million edges and (b) the symmetric mutual information problem using a four month period air pollution data set. In comparison with uniform mutation and a recently proposed dynamic scheme, our operator comes out on top on these instances.
Analysis of protrusion dynamics in amoeboid cell motility by means of regularized contour flows
(2021)
Amoeboid cell motility is essential for a wide range of biological processes including wound healing, embryonic morphogenesis, and cancer metastasis. It relies on complex dynamical patterns of cell shape changes that pose long-standing challenges to mathematical modeling and raise a need for automated and reproducible approaches to extract quantitative morphological features from image sequences. Here, we introduce a theoretical framework and a computational method for obtaining smooth representations of the spatiotemporal contour dynamics from stacks of segmented microscopy images. Based on a Gaussian process regression we propose a one-parameter family of regularized contour flows that allows us to continuously track reference points (virtual markers) between successive cell contours. We use this approach to define a coordinate system on the moving cell boundary and to represent different local geometric quantities in this frame of reference. In particular, we introduce the local marker dispersion as a measure to identify localized membrane expansions and provide a fully automated way to extract the properties of such expansions, including their area and growth time. The methods are available as an open-source software package called AmoePy, a Python-based toolbox for analyzing amoeboid cell motility (based on time-lapse microscopy data), including a graphical user interface and detailed documentation. Due to the mathematical rigor of our framework, we envision it to be of use for the development of novel cell motility models. We mainly use experimental data of the social amoeba Dictyostelium discoideum to illustrate and validate our approach. <br /> Author summary Amoeboid motion is a crawling-like cell migration that plays an important key role in multiple biological processes such as wound healing and cancer metastasis. This type of cell motility results from expanding and simultaneously contracting parts of the cell membrane. From fluorescence images, we obtain a sequence of points, representing the cell membrane, for each time step. By using regression analysis on these sequences, we derive smooth representations, so-called contours, of the membrane. Since the number of measurements is discrete and often limited, the question is raised of how to link consecutive contours with each other. In this work, we present a novel mathematical framework in which these links are described by regularized flows allowing a certain degree of concentration or stretching of neighboring reference points on the same contour. This stretching rate, the so-called local dispersion, is used to identify expansions and contractions of the cell membrane providing a fully automated way of extracting properties of these cell shape changes. We applied our methods to time-lapse microscopy data of the social amoeba Dictyostelium discoideum.
Proceedings of the HPI Research School on Service-oriented Systems Engineering 2020 Fall Retreat
(2021)
Design and Implementation of service-oriented architectures imposes a huge number of research questions from the fields of software engineering, system analysis and modeling, adaptability, and application integration. Component orientation and web services are two approaches for design and realization of complex web-based system. Both approaches allow for dynamic application adaptation as well as integration of enterprise application.
Service-Oriented Systems Engineering represents a symbiosis of best practices in object-orientation, component-based development, distributed computing, and business process management. It provides integration of business and IT concerns.
The annual Ph.D. Retreat of the Research School provides each member the opportunity to present his/her current state of their research and to give an outline of a prospective Ph.D. thesis. Due to the interdisciplinary structure of the research school, this technical report covers a wide range of topics. These include but are not limited to: Human Computer Interaction and Computer Vision as Service; Service-oriented Geovisualization Systems; Algorithm Engineering for Service-oriented Systems; Modeling and Verification of Self-adaptive Service-oriented Systems; Tools and Methods for Software Engineering in Service-oriented Systems; Security Engineering of Service-based IT Systems; Service-oriented Information Systems; Evolutionary Transition of Enterprise Applications to Service Orientation; Operating System Abstractions for Service-oriented Computing; and Services Specification, Composition, and Enactment.
Intrinsic decomposition refers to the problem of estimating scene characteristics, such as albedo and shading, when one view or multiple views of a scene are provided. The inverse problem setting, where multiple unknowns are solved given a single known pixel-value, is highly under-constrained. When provided with correlating image and depth data, intrinsic scene decomposition can be facilitated using depth-based priors, which nowadays is easy to acquire with high-end smartphones by utilizing their depth sensors. In this work, we present a system for intrinsic decomposition of RGB-D images on smartphones and the algorithmic as well as design choices therein. Unlike state-of-the-art methods that assume only diffuse reflectance, we consider both diffuse and specular pixels. For this purpose, we present a novel specularity extraction algorithm based on a multi-scale intensity decomposition and chroma inpainting. At this, the diffuse component is further decomposed into albedo and shading components. We use an inertial proximal algorithm for non-convex optimization (iPiano) to ensure albedo sparsity. Our GPU-based visual processing is implemented on iOS via the Metal API and enables interactive performance on an iPhone 11 Pro. Further, a qualitative evaluation shows that we are able to obtain high-quality outputs. Furthermore, our proposed approach for specularity removal outperforms state-of-the-art approaches for real-world images, while our albedo and shading layer decomposition is faster than the prior work at a comparable output quality. Manifold applications such as recoloring, retexturing, relighting, appearance editing, and stylization are shown, each using the intrinsic layers obtained with our method and/or the corresponding depth data.
Learning the causal structures from observational data is an omnipresent challenge in data science. The amount of observational data available to Causal Structure Learning (CSL) algorithms is increasing as data is collected at high frequency from many data sources nowadays. While processing more data generally yields higher accuracy in CSL, the concomitant increase in the runtime of CSL algorithms hinders their widespread adoption in practice. CSL is a parallelizable problem. Existing parallel CSL algorithms address execution on multi-core Central Processing Units (CPUs) with dozens of compute cores. However, modern computing systems are often heterogeneous and equipped with Graphics Processing Units (GPUs) to accelerate computations. Typically, these GPUs provide several thousand compute cores for massively parallel data processing.
To shorten the runtime of CSL algorithms, we design efficient execution strategies that leverage the parallel processing power of GPUs. Particularly, we derive GPU-accelerated variants of a well-known constraint-based CSL method, the PC algorithm, as it allows choosing a statistical Conditional Independence test (CI test) appropriate to the observational data characteristics.
Our two main contributions are: (1) to reflect differences in the CI tests, we design three GPU-based variants of the PC algorithm tailored to CI tests that handle data with the following characteristics. We develop one variant for data assuming the Gaussian distribution model, one for discrete data, and another for mixed discrete-continuous data and data with non-linear relationships. Each variant is optimized for the appropriate CI test leveraging GPU hardware properties, such as shared or thread-local memory. Our GPU-accelerated variants outperform state-of-the-art parallel CPU-based algorithms by factors of up to 93.4× for data assuming the Gaussian distribution model, up to 54.3× for discrete data, up to 240× for continuous data with non-linear relationships and up to 655× for mixed discrete-continuous data. However, the proposed GPU-based variants are limited to datasets that fit into a single GPU’s memory. (2) To overcome this shortcoming, we develop approaches to scale our GPU-based variants beyond a single GPU’s memory capacity. For example, we design an out-of-core GPU variant that employs explicit memory management to process arbitrary-sized datasets. Runtime measurements on a large gene expression dataset reveal that our out-of-core GPU variant is 364 times faster than a parallel CPU-based CSL algorithm. Overall, our proposed GPU-accelerated variants speed up CSL in numerous settings to foster CSL’s adoption in practice and research.
We study the concept of reversibility in connection with parallel communicating systems of finite automata (PCFA in short). We define the notion of reversibility in the case of PCFA (also covering the non-deterministic case) and discuss the relationship of the reversibility of the systems and the reversibility of its components. We show that a system can be reversible with non-reversible components, and the other way around, the reversibility of the components does not necessarily imply the reversibility of the system as a whole. We also investigate the computational power of deterministic centralized reversible PCFA. We show that these very simple types of PCFA (returning or non-returning) can recognize regular languages which cannot be accepted by reversible (deterministic) finite automata, and that they can even accept languages that are not context-free. We also separate the deterministic and non-deterministic variants in the case of systems with non-returning communication. We show that there are languages accepted by non-deterministic centralized PCFA, which cannot be recognized by any deterministic variant of the same type.
Decubitus is one of the most relevant diseases in nursing and the most expensive to treat. It is caused by sustained pressure on tissue, so it particularly affects bed-bound patients. This work lays a foundation for pressure mattress-based decubitus prophylaxis by implementing a solution to the single-frame 2D Human Pose Estimation problem.
For this, methods of Deep Learning are employed. Two approaches are examined, a coarse-to-fine Convolutional Neural Network for direct regression of joint coordinates and a U-Net for the derivation of probability distribution heatmaps.
We conclude that training our models on a combined dataset of the publicly available Bodies at Rest and SLP data yields the best results. Furthermore, various preprocessing techniques are investigated, and a hyperparameter optimization is performed to discover an improved model architecture.
Another finding indicates that the heatmap-based approach outperforms direct regression.
This model achieves a mean per-joint position error of 9.11 cm for the Bodies at Rest data and 7.43 cm for the SLP data.
We find that it generalizes well on data from mattresses other than those seen during training but has difficulties detecting the arms correctly.
Additionally, we give a brief overview of the medical data annotation tool annoto we developed in the bachelor project and furthermore conclude that the Scrum framework and agile practices enhanced our development workflow.
The reconstruction of cone-beam computed tomography data using filtered back-projection algorithms unavoidably results in severe artefacts. We describe how the Direct Iterative Reconstruction of Computed Tomography Trajectories (DIRECTT) algorithm can be combined with a model of the artefacts for the reconstruction of such data. The implementation of DIRECTT results in reconstructed volumes of superior quality compared to the conventional algorithms.
We introduce a logic-based incremental approach to graph repair, generating a sound and complete (upon termination) overview of least-changing graph repairs from which a user may select a graph repair based on non-formalized further requirements. This incremental approach features delta preservation as it allows to restrict the generation of graph repairs to delta-preserving graph repairs, which do not revert the additions and deletions of the most recent consistency-violating graph update. We specify consistency of graphs using the logic of nested graph conditions, which is equivalent to first-order logic on graphs. Technically, the incremental approach encodes if and how the graph under repair satisfies a graph condition using the novel data structure of satisfaction trees, which are adapted incrementally according to the graph updates applied. In addition to the incremental approach, we also present two state-based graph repair algorithms, which restore consistency of a graph independent of the most recent graph update and which generate additional graph repairs using a global perspective on the graph under repair. We evaluate the developed algorithms using our prototypical implementation in the tool AutoGraph and illustrate our incremental approach using a case study from the graph database domain.
RailChain
(2023)
The RailChain project designed, implemented, and experimentally evaluated a juridical recorder that is based on a distributed consensus protocol. That juridical blockchain recorder has been realized as distributed ledger on board the advanced TrainLab (ICE-TD 605 017) of Deutsche Bahn.
For the project, a consortium consisting of DB Systel, Siemens, Siemens Mobility, the Hasso Plattner Institute for Digital Engineering, Technische Universität Braunschweig, TÜV Rheinland InterTraffic, and Spherity has been formed. These partners not only concentrated competencies in railway operation, computer science, regulation, and approval, but also combined experiences from industry, research from academia, and enthusiasm from startups.
Distributed ledger technologies (DLTs) define distributed databases and express a digital protocol for transactions between business partners without the need for a trusted intermediary. The implementation of a blockchain with real-time requirements for the local network of a railway system (e.g., interlocking or train) allows to log data in the distributed system verifiably in real-time. For this, railway-specific assumptions can be leveraged to make modifications to standard blockchains protocols.
EULYNX and OCORA (Open CCS On-board Reference Architecture) are parts of a future European reference architecture for control command and signalling (CCS, Reference CCS Architecture – RCA). Both architectural concepts outline heterogeneous IT systems with components from multiple manufacturers. Such systems introduce novel challenges for the approved and safety-relevant CCS of railways which were considered neither for road-side nor for on-board systems so far. Logging implementations, such as the common juridical recorder on vehicles, can no longer be realized as a central component of a single manufacturer. All centralized approaches are in question.
The research project RailChain is funded by the mFUND program and gives practical evidence that distributed consensus protocols are a proper means to immutably (for legal purposes) store state information of many system components from multiple manufacturers. The results of RailChain have been published, prototypically implemented, and experimentally evaluated in large-scale field tests on the advanced TrainLab. At the same time, the project showed how RailChain can be integrated into the road-side and on-board architecture given by OCORA and EULYNX.
Logged data can now be analysed sooner and also their trustworthiness is being increased. This enables, e.g., auditable predictive maintenance, because it is ensured that data is authentic and unmodified at any point in time.
The “HPI Future SOC Lab” is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners.
The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies.
This technical report presents results of research projects executed in 2018. Selected projects have presented their results on April 17th and November 14th 2017 at the Future SOC Lab Day events.
In this thesis, we investigate language learning in the formalisation of Gold [Gol67]. Here, a learner, being successively presented all information of a target language, conjectures which language it believes to be shown. Once these hypotheses converge syntactically to a correct explanation of the target language, the learning is considered successful. Fittingly, this is termed explanatory learning. To model learning strategies, we impose restrictions on the hypotheses made, for example requiring the conjectures to follow a monotonic behaviour. This way, we can study the impact a certain restriction has on learning.
Recently, the literature shifted towards map charting. Here, various seemingly unrelated restrictions are contrasted, unveiling interesting relations between them. The results are then depicted in maps. For explanatory learning, the literature already provides maps of common restrictions for various forms of data presentation.
In the case of behaviourally correct learning, where the learners are required to converge semantically instead of syntactically, the same restrictions as in explanatory learning have been investigated. However, a similarly complete picture regarding their interaction has not been presented yet.
In this thesis, we transfer the map charting approach to behaviourally correct learning. In particular, we complete the partial results from the literature for many well-studied restrictions and provide full maps for behaviourally correct learning with different types of data presentation. We also study properties of learners assessed important in the literature. We are interested whether learners are consistent, that is, whether their conjectures include the data they are built on. While learners cannot be assumed consistent in explanatory learning, the opposite is the case in behaviourally correct learning. Even further, it is known that learners following different restrictions may be assumed consistent. We contribute to the literature by showing that this is the case for all studied restrictions.
We also investigate mathematically interesting properties of learners. In particular, we are interested in whether learning under a given restriction may be done with strongly Bc-locking learners. Such learners are of particular value as they allow to apply simulation arguments when, for example, comparing two learning paradigms to each other. The literature gives a rich ground on when learners may be assumed strongly Bc-locking, which we complete for all studied restrictions.
Empirical investigations on the uncanny valley have almost solely focused on the analysis of people?s noninteractive perception of a robot at first sight. Recent studies suggest, however, that these uncanny first impressions may be significantly altered over an interaction. What is yet to discover is whether certain interaction patterns can lead to a faster decline in uncanny feelings. In this paper, we present a study in which participants with limited expertise in Computer Science played a collaborative geography game with a Furhat robot. During the game, Furhat displayed one of two personalities, which corresponded to two different interaction strategies. The robot was either optimistic and encouraging, or impatient and provocative. We performed the study in a science museum and recruited participants among the visitors. Our findings suggest that a robot that is rated high on agreeableness, emotional stability, and conscientiousness can indeed weaken uncanny feelings. This study has important implications for human-robot interaction design as it further highlights that a first impression, merely based on a robot?s appearance, is not indicative of the affinity people might develop towards it throughout an interaction. We thus argue that future work should emphasize investigations on exact interaction patterns that can help to overcome uncanny feelings.
The amount of data stored in databases and the complexity of database workloads are ever- increasing. Database management systems (DBMSs) offer many configuration options, such as index creation or unique constraints, which must be adapted to the specific instance to efficiently process large volumes of data. Currently, such database optimization is complicated, manual work performed by highly skilled database administrators (DBAs). In cloud scenarios, manual database optimization even becomes infeasible: it exceeds the abilities of the best DBAs due to the enormous number of deployed DBMS instances (some providers maintain millions of instances), missing domain knowledge resulting from data privacy requirements, and the complexity of the configuration tasks.
Therefore, we investigate how to automate the configuration of DBMSs efficiently with the help of unsupervised database optimization. While there are numerous configuration options, in this thesis, we focus on automatic index selection and the use of data dependencies, such as functional dependencies, for query optimization. Both aspects have an extensive performance impact and complement each other by approaching unsupervised database optimization from different perspectives.
Our contributions are as follows: (1) we survey automated state-of-the-art index selection algorithms regarding various criteria, e.g., their support for index interaction. We contribute an extensible platform for evaluating the performance of such algorithms with industry-standard datasets and workloads. The platform is well-received by the community and has led to follow-up research. With our platform, we derive the strengths and weaknesses of the investigated algorithms. We conclude that existing solutions often have scalability issues and cannot quickly determine (near-)optimal solutions for large problem instances. (2) To overcome these limitations, we present two new algorithms. Extend determines (near-)optimal solutions with an iterative heuristic. It identifies the best index configurations for the evaluated benchmarks. Its selection runtimes are up to 10 times lower compared with other near-optimal approaches. SWIRL is based on reinforcement learning and delivers solutions instantly. These solutions perform within 3 % of the optimal ones. Extend and SWIRL are available as open-source implementations.
(3) Our index selection efforts are complemented by a mechanism that analyzes workloads to determine data dependencies for query optimization in an unsupervised fashion. We describe and classify 58 query optimization techniques based on functional, order, and inclusion dependencies as well as on unique column combinations. The unsupervised mechanism and three optimization techniques are implemented in our open-source research DBMS Hyrise. Our approach reduces the Join Order Benchmark’s runtime by 26 % and accelerates some TPC-DS queries by up to 58 times.
Additionally, we have developed a cockpit for unsupervised database optimization that allows interactive experiments to build confidence in such automated techniques. In summary, our contributions improve the performance of DBMSs, support DBAs in their work, and enable them to contribute their time to other, less arduous tasks.
Pictures are a medium that helps make the past tangible and preserve memories. Without context, they are not able to do so. Pictures are brought to life by their associated stories. However, the older pictures become, the fewer contemporary witnesses can tell these stories.
Especially for large, analog picture archives, knowledge and memories are spread over many people. This creates several challenges: First, the pictures must be digitized to save them from decaying and make them available to the public. Since a simple listing of all the pictures is confusing, the pictures should be structured accessibly. Second, known information that makes the stories vivid needs to be added to the pictures. Users should get the opportunity to contribute their knowledge and memories. To make this usable for all interested parties, even for older, less technophile generations, the interface should be intuitive and error-tolerant.
The resulting requirements are not covered in their entirety by any existing software solution without losing the intuitive interface or the scalability of the system.
Therefore, we have developed our digital picture archive within the scope of a bachelor project in cooperation with the Bad Harzburg-Stiftung. For the implementation of this web application, we use the UI framework React in the frontend, which communicates via a GraphQL interface with the Content Management System Strapi in the backend. The use of this system enables our project partner to create an efficient process from scanning analog pictures to presenting them to visitors in an organized and annotated way. To customize the solution for both picture delivery and information contribution for our target group, we designed prototypes and evaluated them with people from Bad Harzburg. This helped us gain valuable insights into our system’s usability and future challenges as well as requirements.
Our web application is already being used daily by our project partner. During the project, we still came up with numerous ideas for additional features to further support the exchange of knowledge.
Accurately solving classification problems nowadays is likely to be the most relevant machine learning task. Binary classification separating two classes only is algorithmically simpler but has fewer potential applications as many real-world problems are multi-class. On the reverse, separating only a subset of classes simplifies the classification task. Even though existing multi-class machine learning algorithms are very flexible regarding the number of classes, they assume that the target set Y is fixed and cannot be restricted once the training is finished. On the other hand, existing state-of-the-art production environments are becoming increasingly interconnected with the advance of Industry 4.0 and related technologies such that additional information can simplify the respective classification problems. In light of this, the main aim of this thesis is to introduce dynamic classification that generalizes multi-class classification such that the target class set can be restricted arbitrarily to a non-empty class subset M of Y at any time between two consecutive predictions.
This task is solved by a combination of two algorithmic approaches. First, classifier calibration, which transforms predictions into posterior probability estimates that are intended to be well calibrated. The analysis provided focuses on monotonic calibration and in particular corrects wrong statements that appeared in the literature. It also reveals that bin-based evaluation metrics, which became popular in recent years, are unjustified and should not be used at all. Next, the validity of Platt scaling, which is the most relevant parametric calibration approach, is analyzed in depth. In particular, its optimality for classifier predictions distributed according to four different families of probability distributions as well its equivalence with Beta calibration up to a sigmoidal preprocessing are proven. For non-monotonic calibration, extended variants on kernel density estimation and the ensemble method EKDE are introduced. Finally, the calibration techniques are evaluated using a simulation study with complete information as well as on a selection of 46 real-world data sets.
Building on this, classifier calibration is applied as part of decomposition-based classification that aims to reduce multi-class problems to simpler (usually binary) prediction tasks. For the involved fusing step performed at prediction time, a new approach based on evidence theory is presented that uses classifier calibration to model mass functions. This allows the analysis of decomposition-based classification against a strictly formal background and to prove closed-form equations for the overall combinations. Furthermore, the same formalism leads to a consistent integration of dynamic class information, yielding a theoretically justified and computationally tractable dynamic classification model. The insights gained from this modeling are combined with pairwise coupling, which is one of the most relevant reduction-based classification approaches, such that all individual predictions are combined with a weight. This not only generalizes existing works on pairwise coupling but also enables the integration of dynamic class information.
Lastly, a thorough empirical study is performed that compares all newly introduced approaches to existing state-of-the-art techniques. For this, evaluation metrics for dynamic classification are introduced that depend on corresponding sampling strategies. Thereafter, these are applied during a three-part evaluation. First, support vector machines and random forests are applied on 26 data sets from the UCI Machine Learning Repository. Second, two state-of-the-art deep neural networks are evaluated on five benchmark data sets from a relatively recent reference work. Here, computationally feasible strategies to apply the presented algorithms in combination with large-scale models are particularly relevant because a naive application is computationally intractable. Finally, reference data from a real-world process allowing the inclusion of dynamic class information are collected and evaluated. The results show that in combination with support vector machines and random forests, pairwise coupling approaches yield the best results, while in combination with deep neural networks, differences between the different approaches are mostly small to negligible. Most importantly, all results empirically confirm that dynamic classification succeeds in improving the respective prediction accuracies. Therefore, it is crucial to pass dynamic class information in respective applications, which requires an appropriate digital infrastructure.
Teaching and learning as well as administrative processes are still experiencing intensive changes with the rise of artificial intelligence (AI) technologies and its diverse application opportunities in the context of higher education. Therewith, the scientific interest in the topic in general, but also specific focal points rose as well. However, there is no structured overview on AI in teaching and administration processes in higher education institutions that allows to identify major research topics and trends, and concretizing peculiarities and develops recommendations for further action. To overcome this gap, this study seeks to systematize the current scientific discourse on AI in teaching and administration in higher education institutions. This study identified an (1) imbalance in research on AI in educational and administrative contexts, (2) an imbalance in disciplines and lack of interdisciplinary research, (3) inequalities in cross-national research activities, as well as (4) neglected research topics and paths. In this way, a comparative analysis between AI usage in administration and teaching and learning processes, a systematization of the state of research, an identification of research gaps as well as further research path on AI in higher education institutions are contributed to research.
Teaching and learning as well as administrative processes are still experiencing intensive changes with the rise of artificial intelligence (AI) technologies and its diverse application opportunities in the context of higher education. Therewith, the scientific interest in the topic in general, but also specific focal points rose as well. However, there is no structured overview on AI in teaching and administration processes in higher education institutions that allows to identify major research topics and trends, and concretizing peculiarities and develops recommendations for further action. To overcome this gap, this study seeks to systematize the current scientific discourse on AI in teaching and administration in higher education institutions. This study identified an (1) imbalance in research on AI in educational and administrative contexts, (2) an imbalance in disciplines and lack of interdisciplinary research, (3) inequalities in cross-national research activities, as well as (4) neglected research topics and paths. In this way, a comparative analysis between AI usage in administration and teaching and learning processes, a systematization of the state of research, an identification of research gaps as well as further research path on AI in higher education institutions are contributed to research.
Defining the metaverse
(2023)
The term Metaverse is emerging as a result of the late push by multinational technology conglomerates and a recent surge of interest in Web 3.0, Blockchain, NFT, and Cryptocurrencies. From a scientific point of view, there is no definite consensus on what the Metaverse will be like. This paper collects, analyzes, and synthesizes scientific definitions and the accompanying major characteristics of the Metaverse using the methodology of a Systematic Literature Review (SLR). Two revised definitions for the Metaverse are presented, both condensing the key attributes, where the first one is rather simplistic holistic describing “a three-dimensional online environment in which users represented by avatars interact with each other in virtual spaces decoupled from the real physical world”. In contrast, the second definition is specified in a more detailed manner in the paper and further discussed. These comprehensive definitions offer specialized and general scholars an application within and beyond the scientific context of the system science, information system science, computer science, and business informatics, by also introducing open research challenges. Furthermore, an outlook on the social, economic, and technical implications is given, and the preconditions that are necessary for a successful implementation are discussed.