004 Datenverarbeitung; Informatik
Refine
Year of publication
Document Type
- Article (332)
- Monograph/Edited Volume (165)
- Doctoral Thesis (157)
- Conference Proceeding (52)
- Postprint (50)
- Master's Thesis (10)
- Other (7)
- Preprint (3)
- Part of a Book (2)
- Bachelor Thesis (1)
Language
- English (587)
- German (192)
- Multiple languages (2)
Keywords
- Informatik (21)
- machine learning (17)
- Didaktik (15)
- Hochschuldidaktik (14)
- Ausbildung (13)
- answer set programming (13)
- Cloud Computing (11)
- cloud computing (11)
- Hasso-Plattner-Institut (10)
- Hasso Plattner Institute (9)
Institute
- Institut für Informatik und Computational Science (270)
- Hasso-Plattner-Institut für Digital Engineering gGmbH (215)
- Hasso-Plattner-Institut für Digital Engineering GmbH (127)
- Extern (65)
- Fachgruppe Betriebswirtschaftslehre (27)
- Mathematisch-Naturwissenschaftliche Fakultät (24)
- Wirtschaftswissenschaften (19)
- Institut für Mathematik (16)
- Bürgerliches Recht (12)
- Digital Engineering Fakultät (8)
The past three decades of policy process studies have seen the emergence of a clear intellectual lineage with regard to complexity. Implicitly or explicitly, scholars have employed complexity theory to examine the intricate dynamics of collective action in political contexts. However, the methodological counterparts to complexity theory, such as computational methods, are rarely used and, even if they are, they are often detached from established policy process theory. Building on a critical review of the application of complexity theory to policy process studies, we present and implement a baseline model of policy processes using the logic of coevolving networks. Our model suggests that an actor's influence depends on their environment and on exogenous events facilitating dialogue and consensus-building. Our results validate previous opinion dynamics models and generate novel patterns. Our discussion provides ground for further research and outlines the path for the field to achieve a computational turn.
PC2P
(2021)
Motivation:
Prediction of protein complexes from protein-protein interaction (PPI) networks is an important problem in systems biology, as they control different cellular functions. The existing solutions employ algorithms for network community detection that identify dense subgraphs in PPI networks. However, gold standards in yeast and human indicate that protein complexes can also induce sparse subgraphs, introducing further challenges in protein complex prediction.
Results:
To address this issue, we formalize protein complexes as biclique spanned subgraphs, which include both sparse and dense subgraphs. We then cast the problem of protein complex prediction as a network partitioning into biclique spanned subgraphs with removal of minimum number of edges, called coherent partition. Since finding a coherent partition is a computationally intractable problem, we devise a parameter-free greedy approximation algorithm, termed Protein Complexes from Coherent Partition (PC2P), based on key properties of biclique spanned subgraphs. Through comparison with nine contenders, we demonstrate that PC2P: (i) successfully identifies modular structure in networks, as a prerequisite for protein complex prediction, (ii) outperforms the existing solutions with respect to a composite score of five performance measures on 75% and 100% of the analyzed PPI networks and gold standards in yeast and human, respectively, and (iii,iv) does not compromise GO semantic similarity and enrichment score of the predicted protein complexes. Therefore, our study demonstrates that clustering of networks in terms of biclique spanned subgraphs is a promising framework for detection of complexes in PPI networks.
In recent years, many efforts have been made to apply image processing techniques for plant leaf identification. However, categorizing leaf images at the cultivar/variety level, because of the very low inter-class variability, is still a challenging task. In this research, we propose an automatic discriminative method based on convolutional neural networks (CNNs) for classifying 12 different cultivars of common beans that belong to three various species. We show that employing advanced loss functions, such as Additive Angular Margin Loss and Large Margin Cosine Loss, instead of the standard softmax loss function for the classification can yield better discrimination between classes and thereby mitigate the problem of low inter-class variability. The method was evaluated by classifying species (level I), cultivars from the same species (level II), and cultivars from different species (level III), based on images from the leaf foreside and backside. The results indicate that the performance of the classification algorithm on the leaf backside image dataset is superior. The maximum mean classification accuracies of 95.86, 91.37 and 86.87% were obtained at the levels I, II and III, respectively. The proposed method outperforms the previous relevant works and provides a reliable approach for plant cultivars identification.
High annotation costs are a substantial bottleneck in applying deep learning architectures to clinically relevant use cases, substantiating the need for algorithms to learn from unlabeled data.
In this work, we propose employing self-supervised methods. To that end, we trained with three self-supervised algorithms on a large corpus of unlabeled dental images, which contained 38K bitewing radiographs (BWRs). We then applied the learned neural network representations on tooth-level dental caries classification, for which we utilized labels extracted from electronic health records (EHRs). Finally, a holdout test-set was established, which consisted of 343 BWRs and was annotated by three dental professionals and approved by a senior dentist.
This test-set was used to evaluate the fine-tuned caries classification models. Our experimental results demonstrate the obtained gains by pretraining models using self-supervised algorithms. These include improved caries classification performance (6 p.p. increase in sensitivity) and, most importantly, improved label-efficiency.
In other words, the resulting models can be fine-tuned using few labels (annotations).
Our results show that using as few as 18 annotations can produce >= 45% sensitivity, which is comparable to human-level diagnostic performance.
This study shows that self-supervision can provide gains in medical image analysis, particularly when obtaining labels is costly and expensive.
Knowledge about causal structures is crucial for decision support in various domains. For example, in discrete manufacturing, identifying the root causes of failures and quality deviations that interrupt the highly automated production process requires causal structural knowledge. However, in practice, root cause analysis is usually built upon individual expert knowledge about associative relationships. But, "correlation does not imply causation", and misinterpreting associations often leads to incorrect conclusions. Recent developments in methods for causal discovery from observational data have opened the opportunity for a data-driven examination. Despite its potential for data-driven decision support, omnipresent challenges impede causal discovery in real-world scenarios. In this thesis, we make a threefold contribution to improving causal discovery in practice.
(1) The growing interest in causal discovery has led to a broad spectrum of methods with specific assumptions on the data and various implementations. Hence, application in practice requires careful consideration of existing methods, which becomes laborious when dealing with various parameters, assumptions, and implementations in different programming languages. Additionally, evaluation is challenging due to the lack of ground truth in practice and limited benchmark data that reflect real-world data characteristics.
To address these issues, we present a platform-independent modular pipeline for causal discovery and a ground truth framework for synthetic data generation that provides comprehensive evaluation opportunities, e.g., to examine the accuracy of causal discovery methods in case of inappropriate assumptions.
(2) Applying constraint-based methods for causal discovery requires selecting a conditional independence (CI) test, which is particularly challenging in mixed discrete-continuous data omnipresent in many real-world scenarios. In this context, inappropriate assumptions on the data or the commonly applied discretization of continuous variables reduce the accuracy of CI decisions, leading to incorrect causal structures.
Therefore, we contribute a non-parametric CI test leveraging k-nearest neighbors methods and prove its statistical validity and power in mixed discrete-continuous data, as well as the asymptotic consistency when used in constraint-based causal discovery. An extensive evaluation of synthetic and real-world data shows that the proposed CI test outperforms state-of-the-art approaches in the accuracy of CI testing and causal discovery, particularly in settings with low sample sizes.
(3) To show the applicability and opportunities of causal discovery in practice, we examine our contributions in real-world discrete manufacturing use cases. For example, we showcase how causal structural knowledge helps to understand unforeseen production downtimes or adds decision support in case of failures and quality deviations in automotive body shop assembly lines.
We study the classical, two-sided stable marriage problem under pairwise preferences. In the most general setting, agents are allowed to express their preferences as comparisons of any two of their edges, and they also have the right to declare a draw or even withdraw from such a comparison. This freedom is then gradually restricted as we specify six stages of orderedness in the preferences, ending with the classical case of strictly ordered lists. We study all cases occurring when combining the three known notions of stability-weak, strong, and super-stability-under the assumption that each side of the bipartite market obtains one of the six degrees of orderedness. By designing three polynomial algorithms and two NP-completeness proofs, we determine the complexity of all cases not yet known and thus give an exact boundary in terms of preference structure between tractable and intractable cases.
Our input is a complete graph G on n vertices where each vertex has a strict ranking of all other vertices in G. The goal is to construct a matching in G that is popular. A matching M is popular if M does not lose a head-to-head election against any matching M ': here each vertex casts a vote for the matching in {M,M '} in which it gets a better assignment. Popular matchings need not exist in the given instance G and the popular matching problem is to decide whether one exists or not. The popular matching problem in G is easy to solve for odd n. Surprisingly, the problem becomes NP-complete for even n, as we show here. This is one of the few graph theoretic problems efficiently solvable when n has one parity and NP-complete when n has the other parity.
Local laws on urban policy, i.e., ordinances directly affect our daily life in various ways (health, business etc.), yet in practice, for many citizens they remain impervious and complex. This article focuses on an approach to make urban policy more accessible and comprehensible to the general public and to government officials, while also addressing pertinent social media postings. Due to the intricacies of the natural language, ranging from complex legalese in ordinances to informal lingo in tweets, it is practical to harness human judgment here. To this end, we mine ordinances and tweets via reasoning based on commonsense knowledge so as to better account for pragmatics and semantics in the text. Ours is pioneering work in ordinance mining, and thus there is no prior labeled training data available for learning. This gap is filled by commonsense knowledge, a prudent choice in situations involving a lack of adequate training data. The ordinance mining can be beneficial to the public in fathoming policies and to officials in assessing policy effectiveness based on public reactions. This work contributes to smart governance, leveraging transparency in governing processes via public involvement. We focus significantly on ordinances contributing to smart cities, hence an important goal is to assess how well an urban region heads towards a smart city as per its policies mapping with smart city characteristics, and the corresponding public satisfaction.
VLDB 2021
(2021)
The 47th International Conference on Very Large Databases (VLDB'21) was held on August 16-20, 2021 as a hybrid conference. It attracted 180 in-person attendees in Copenhagen and 840 remote attendees. In this paper, we describe our key decisions as general chairs and program committee chairs and share the lessons we learned.
Modern data analysis tasks often involve control flow statements, such as the iterations in PageRank and K-means. To achieve scalability, developers usually implement these tasks in distributed dataflow systems, such as Spark and Flink. Designers of such systems have to choose between providing imperative or functional control flow constructs to users. Imperative constructs are easier to use, but functional constructs are easier to compile to an efficient dataflow job. We propose Mitos, a system where control flow is both easy to use and efficient. Mitos relies on an intermediate representation based on the static single assignment form. This allows us to abstract away from specific control flow constructs and treat any imperative control flow uniformly both when building the dataflow job and when coordinating the distributed execution.
The devil in disguise
(2021)
Envy constitutes a serious issue on Social Networking Sites (SNSs), as this painful emotion can severely diminish individuals' well-being. With prior research mainly focusing on the affective consequences of envy in the SNS context, its behavioral consequences remain puzzling. While negative interactions among SNS users are an alarming issue, it remains unclear to which extent the harmful emotion of malicious envy contributes to these toxic dynamics. This study constitutes a first step in understanding malicious envy’s causal impact on negative interactions within the SNS sphere. Within an online experiment, we experimentally induce malicious envy and measure its immediate impact on users’ negative behavior towards other users. Our findings show that malicious envy seems to be an essential factor fueling negativity among SNS users and further illustrate that this effect is especially pronounced when users are provided an objective factor to mask their envy and justify their norm-violating negative behavior.
The metaverse is envisioned as a virtual shared space facilitated by emerging technologies such as virtual reality (VR), augmented reality (AR), the Internet of Things (IoT), 5G, artificial intelligence (AI), big data, spatial computing, and digital twins (Allam et al., 2022; Dwivedi et al., 2022; Ravenscraft, 2022; Wiles, 2022). While still a nascent concept, the metaverse has the potential to “transform the physical world, as well as transport or extend physical activities to a virtual world” (Wiles, 2022). Big data technologies will also be essential in managing the enormous amounts of data created in the metaverse (Sun et al., 2022). Metaverse technologies can offer the public sector a host of benefits, such as simplified information exchange, stronger communication with citizens, better access to public services, or benefiting from a new virtual economy. Implementations are underway in several cities around the world (Geraghty et al., 2022). In this paper, we analyze metaverse opportunities for the public sector and explore their application in the context of Germany’s Federal Employment Agency. Based on an analysis of academic literature and practical examples, we create a capability map for potential metaverse business capabilities for different areas of the public sector (broadly defined). These include education (virtual training and simulation, digital campuses that offer not just online instruction but a holistic university campus experience, etc.), tourism (virtual travel to remote locations and museums, virtual festival participation, etc.), health (employee training – as for emergency situations, virtual simulations for patient treatment – for example, for depression or anxiety, etc.), military (virtual training to experience operational scenarios without being exposed to a real-world threats, practice strategic decision-making, or gain technical knowledge for operating and repairing equipment, etc.), administrative services (document processing, virtual consultations for citizens, etc.), judiciary (AI decision-making aids, virtual proceedings, etc.), public safety (virtual training for procedural issues, special operations, or unusual situations, etc.), emergency management (training for natural disasters, etc.), and city planning (visualization of future development projects and interactive feedback, traffic management, attraction gamification, etc.), among others. We further identify several metaverse application areas for Germany's Federal Employment Agency. These applications can help it realize the goals of the German government for digital transformation that enables faster, more effective, and innovative government services. They include training of employees, training of customers, and career coaching for customers. These applications can be implemented using interactive learning games with AI agents, virtual representations of the organizational spaces, and avatars interacting with each other in these spaces. Metaverse applications will both use big data (to design the virtual environments) and generate big data (from virtual interactions). Issues related to data availability, quality, storage, processing (and related computing power requirements), interoperability, sharing, privacy and security will need to be addressed in these emerging metaverse applications (Sun et al., 2022). Special attention is needed to understand the potential for power inequities (wealth inequity, algorithmic bias, digital exclusion) due to technologies such as VR (Egliston & Carter, 2021), harmful surveillance practices (Bibri & Allam, 2022), and undesirable user behavior or negative psychological impacts (Dwivedi et al., 2022). The results of this exploratory study can inform public sector organizations of emerging metaverse opportunities and enable them to develop plans for action as more of the metaverse technologies become a reality. While the metaverse body of research is still small and research agendas are only now starting to emerge (Dwivedi et al., 2022), this study offers a building block for future development and analysis of metaverse applications.
Nowadays, production planning and control must cope with mass customization, increased fluctuations in demand, and high competition pressures. Despite prevailing market risks, planning accuracy and increased adaptability in the event of disruptions or failures must be ensured, while simultaneously optimizing key process indicators. To manage that complex task, neural networks that can process large quantities of high-dimensional data in real time have been widely adopted in recent years. Although these are already extensively deployed in production systems, a systematic review of applications and implemented agent embeddings and architectures has not yet been conducted. The main contribution of this paper is to provide researchers and practitioners with an overview of applications and applied embeddings and to motivate further research in neural agent-based production. Findings indicate that neural agents are not only deployed in diverse applications, but are also increasingly implemented in multi-agent environments or in combination with conventional methods — leveraging performances compared to benchmarks and reducing dependence on human experience. This not only implies a more sophisticated focus on distributed production resources, but also broadening the perspective from a local to a global scale. Nevertheless, future research must further increase scalability and reproducibility to guarantee a simplified transfer of results to reality.
Nowadays, production planning and control must cope with mass customization, increased fluctuations in demand, and high competition pressures. Despite prevailing market risks, planning accuracy and increased adaptability in the event of disruptions or failures must be ensured, while simultaneously optimizing key process indicators. To manage that complex task, neural networks that can process large quantities of high-dimensional data in real time have been widely adopted in recent years. Although these are already extensively deployed in production systems, a systematic review of applications and implemented agent embeddings and architectures has not yet been conducted. The main contribution of this paper is to provide researchers and practitioners with an overview of applications and applied embeddings and to motivate further research in neural agent-based production. Findings indicate that neural agents are not only deployed in diverse applications, but are also increasingly implemented in multi-agent environments or in combination with conventional methods — leveraging performances compared to benchmarks and reducing dependence on human experience. This not only implies a more sophisticated focus on distributed production resources, but also broadening the perspective from a local to a global scale. Nevertheless, future research must further increase scalability and reproducibility to guarantee a simplified transfer of results to reality.
Algorithmic management
(2022)
Algorithmic management
(2022)
Column-oriented database systems can efficiently process transactional and analytical queries on a single node. However, increasing or peak analytical loads can quickly saturate single-node database systems. Then, a common scale-out option is using a database cluster with a single primary node for transaction processing and read-only replicas. Using (the naive) full replication, queries are distributed among nodes independently of the accessed data. This approach is relatively expensive because all nodes must store all data and apply all data modifications caused by inserts, deletes, or updates.
In contrast to full replication, partial replication is a more cost-efficient implementation: Instead of duplicating all data to all replica nodes, partial replicas store only a subset of the data while being able to process a large workload share. Besides lower storage costs, partial replicas enable (i) better scaling because replicas must potentially synchronize only subsets of the data modifications and thus have more capacity for read-only queries and (ii) better elasticity because replicas have to load less data and can be set up faster. However, splitting the overall workload evenly among the replica nodes while optimizing the data allocation is a challenging assignment problem.
The calculation of optimized data allocations in a partially replicated database cluster can be modeled using integer linear programming (ILP). ILP is a common approach for solving assignment problems, also in the context of database systems. Because ILP is not scalable, existing approaches (also for calculating partial allocations) often fall back to simple (e.g., greedy) heuristics for larger problem instances. Simple heuristics may work well but can lose optimization potential.
In this thesis, we present optimal and ILP-based heuristic programming models for calculating data fragment allocations for partially replicated database clusters. Using ILP, we are flexible to extend our models to (i) consider data modifications and reallocations and (ii) increase the robustness of allocations to compensate for node failures and workload uncertainty. We evaluate our approaches for TPC-H, TPC-DS, and a real-world accounting workload and compare the results to state-of-the-art allocation approaches. Our evaluations show significant improvements for varied allocation’s properties: Compared to existing approaches, we can, for example, (i) almost halve the amount of allocated data, (ii) improve the throughput in case of node failures and workload uncertainty while using even less memory, (iii) halve the costs of data modifications, and (iv) reallocate less than 90% of data when adding a node to the cluster. Importantly, we can calculate the corresponding ILP-based heuristic solutions within a few seconds. Finally, we demonstrate that the ideas of our ILP-based heuristics are also applicable to the index selection problem.
The transversal hypergraph problem asks to enumerate the minimal hitting sets of a hypergraph. If the solutions have bounded size, Eiter and Gottlob [SICOMP'95] gave an algorithm running in output-polynomial time, but whose space requirement also scales with the output. We improve this to polynomial delay and space. Central to our approach is the extension problem, deciding for a set X of vertices whether it is contained in any minimal hitting set. We show that this is one of the first natural problems to be W[3]-complete. We give an algorithm for the extension problem running in time O(m(vertical bar X vertical bar+1) n) and prove a SETH-lower bound showing that this is close to optimal. We apply our enumeration method to the discovery problem of minimal unique column combinations from data profiling. Our empirical evaluation suggests that the algorithm outperforms its worst-case guarantees on hypergraphs stemming from real-world databases.
Viper
(2021)
Key-value stores (KVSs) have found wide application in modern software systems. For persistence, their data resides in slow secondary storage, which requires KVSs to employ various techniques to increase their read and write performance from and to the underlying medium. Emerging persistent memory (PMem) technologies offer data persistence at close-to-DRAM speed, making them a promising alternative to classical disk-based storage. However, simply drop-in replacing existing storage with PMem does not yield good results, as block-based access behaves differently in PMem than on disk and ignores PMem's byte addressability, layout, and unique performance characteristics. In this paper, we propose three PMem-specific access patterns and implement them in a hybrid PMem-DRAM KVS called Viper. We employ a DRAM-based hash index and a PMem-aware storage layout to utilize the random-write speed of DRAM and efficient sequential-write performance PMem. Our evaluation shows that Viper significantly outperforms existing KVSs for core KVS operations while providing full data persistence. Moreover, Viper outperforms existing PMem-only, hybrid, and disk-based KVSs by 4-18x for write workloads, while matching or surpassing their get performance.
Selbstbestimmtes Lernen mit Onlinekursen findet zunehmend mehr Akzeptanz in unserer Gesellschaft. Lernende können mithilfe von Onlinekursen selbst festlegen, was sie wann lernen und Kurse können durch vielfältige Adaptionen an den Lernfortschritt der Nutzer angepasst und individualisiert werden. Auf der einen Seite ist eine große Zielgruppe für diese Lernangebote vorhanden. Auf der anderen Seite sind die Erstellung von Onlinekursen, ihre Bereitstellung, Wartung und Betreuung kostenintensiv, wodurch hochwertige Angebote häufig kostenpflichtig angeboten werden müssen, um als Anbieter zumindest kostenneutral agieren zu können. In diesem Beitrag erörtern und diskutieren wir ein offenes, nachhaltiges datengetriebenes zweiseitiges Geschäftsmodell zur Verwertung geprüfter Onlinekurse und deren kostenfreie Bereitstellung für jeden Lernenden. Kern des Geschäftsmodells ist die Nutzung der dabei entstehenden Verhaltensdaten, die daraus mögliche Ableitung von Persönlichkeitsmerkmalen und Interessen und deren Nutzung im kommerziellen Kontext. Dies ist eine bei der Websuche bereits weitläufig akzeptierte Methode, welche nun auf den Lernkontext übertragen wird. Welche Möglichkeiten, Herausforderungen, aber auch Barrieren überwunden werden müssen, damit das Geschäftsmodell nachhaltig und ethisch vertretbar funktioniert, werden zwei unabhängige, jedoch synergetisch verbundene Geschäftsmodelle vorgestellt und diskutiert. Zusätzlich wurde die Akzeptanz und Erwartung der Zielgruppe für das vorgestellte Geschäftsmodell untersucht, um notwendige Kernressourcen für die Praxis abzuleiten. Die Ergebnisse der Untersuchung zeigen, dass das Geschäftsmodell von den Nutzer*innen grundlegend akzeptiert wird. 10 % der Befragten würden es bevorzugen, mit virtuellen Assistenten – anstelle mit Tutor*innen zu lernen. Zudem ist der Großteil der Nutzer*innen sich nicht darüber bewusst, dass Persönlichkeitsmerkmale anhand des Nutzerverhaltens abgeleitet werden können.