publish.UP Search

Mise-Unseen (2019)

Marwecki, Sebastian ; Wilson, Andrew D. ; Ofek, Eyal ; Franco, Mar Gonzalez ; Holz, Christian

Creating or arranging objects at runtime is needed in many virtual reality applications, but such changes are noticed when they occur inside the user's field of view. We present Mise-Unseen, a software system that applies such scene changes covertly inside the user's field of view. Mise-Unseen leverages gaze tracking to create models of user attention, intention, and spatial memory to determine if and when to inject a change. We present seven applications of Mise-Unseen to unnoticeably modify the scene within view (i) to hide that task difficulty is adapted to the user, (ii) to adapt the experience to the user's preferences, (iii) to time the use of low fidelity effects, (iv) to detect user choice for passive haptics even when lacking physical props, (v) to sustain physical locomotion despite a lack of physical space, (vi) to reduce motion sickness during virtual locomotion, and (vii) to verify user understanding during story progression. We evaluated Mise-Unseen and our applications in a user study with 15 participants and find that while gaze data indeed supports obfuscating changes inside the field of view, a change is rendered unnoticeably by using gaze in combination with common masking techniques.

Editorial (2019)

Björk, Jennie ; Hölze, Katharina

Discovering commute patterns via process mining (2019)

Yousfi, Alaaeddine ; Weske, Mathias

Ubiquitous computing has proven its relevance and efficiency in improving the user experience across a myriad of situations. It is now the ineluctable solution to keep pace with the ever-changing environments in which current systems operate. Despite the achievements of ubiquitous computing, this discipline is still overlooked in business process management. This is surprising, since many of today’s challenges, in this domain, can be addressed by methods and techniques from ubiquitous computing, for instance user context and dynamic aspects of resource locations. This paper takes a first step to integrate methods and techniques from ubiquitous computing in business process management. To do so, we propose discovering commute patterns via process mining. Through our proposition, we can deduce the users’ significant locations, routes, travel times and travel modes. This information can be a stepping-stone toward helping the business process management community embrace the latest achievements in ubiquitous computing, mainly in location-based service. To corroborate our claims, a user study was conducted. The significant places, routes, travel modes and commuting times of our test subjects were inferred with high accuracies. All in all, ubiquitous computing can enrich the processes with new capabilities that go beyond what has been established in business process management so far.

Interactive Close-Up Rendering for Detail plus Overview Visualization of 3D Digital Terrain Models (2019)

Trapp, Matthias ; Döllner, Jürgen Roland Friedrich

This paper presents an interactive rendering technique for detail+overview visualization of 3D digital terrain models using interactive close-ups. A close-up is an alternative presentation of input data varying with respect to geometrical scale, mapping, appearance, as well as Level-of-Detail (LOD) and Level-of-Abstraction (LOA) used. The presented 3D close-up approach enables in-situ comparison of multiple Regionof-Interests (ROIs) simultaneously. We describe a GPU-based rendering technique for the image-synthesis of multiple close-ups in real-time.

Semantic-driven Visualization Techniques for Interactive Exploration of 3D Indoor Models (2019)

Florio, Alessandro ; Trapp, Matthias ; Döllner, Jürgen Roland Friedrich

The availability of detailed virtual 3D building models including representations of indoor elements, allows for a wide number of applications requiring effective exploration and navigation functionality. Depending on the application context, users should be enabled to focus on specific Objects-of-Interests (OOIs) or important building elements. This requires approaches to filtering building parts as well as techniques to visualize important building objects and their relations. For it, this paper explores the application and combination of interactive rendering techniques as well as their semanticallydriven configuration in the context of 3D indoor models.

Real-time Screen-space Geometry Draping for 3D Digital Terrain Models (2019)

Trapp, Matthias ; Döllner, Jürgen Roland Friedrich

A fundamental task in 3D geovisualization and GIS applications is the visualization of vector data that can represent features such as transportation networks or land use coverage. Mapping or draping vector data represented by geometric primitives (e.g., polylines or polygons) to 3D digital elevation or 3D digital terrain models is a challenging task. We present an interactive GPU-based approach that performs geometry-based draping of vector data on per-frame basis using an image-based representation of a 3D digital elevation or terrain model only.

Kyub (2019)

Baudisch, Patrick Markus ; Silber, Arthur ; Kommana, Yannis ; Gruner, Milan ; Wall, Ludwig ; Reuss, Kevin ; Heilman, Lukas ; Kovacs, Robert ; Rechlitz, Daniel ; Roumen, Thijs

We present an interactive editing system for laser cutting called kyub. Kyub allows users to create models efficiently in 3D, which it then unfolds into the 2D plates laser cutters expect. Unlike earlier systems, such as FlatFitFab, kyub affords construction based on closed box structures, which allows users to turn very thin material, such as 4mm plywood, into objects capable of withstanding large forces, such as chairs users can actually sit on. To afford such sturdy construction, every kyub project begins with a simple finger-joint "boxel"-a structure we found to be capable of withstanding over 500kg of load. Users then extend their model by attaching additional boxels. Boxels merge automatically, resulting in larger, yet equally strong structures. While the concept of stacking boxels allows kyub to offer the strong affordance and ease of use of a voxel-based editor, boxels are not confined to a grid and readily combine with kuyb's various geometry deformation tools. In our technical evaluation, objects built with kyub withstood hundreds of kilograms of loads. In our user study, non-engineers rated the learnability of kyub 6.1/7.

Understanding Metamaterial Mechanisms (2019)

Ion, Alexandra ; Lindlbauer, David ; Herholz, Philipp ; Alexa, Marc ; Baudisch, Patrick Markus

In this paper, we establish the underlying foundations of mechanisms that are composed of cell structures-known as metamaterial mechanisms. Such metamaterial mechanisms were previously shown to implement complete mechanisms in the cell structure of a 3D printed material, without the need for assembly. However, their design is highly challenging. A mechanism consists of many cells that are interconnected and impose constraints on each other. This leads to unobvious and non-linear behavior of the mechanism, which impedes user design. In this work, we investigate the underlying topological constraints of such cell structures and their influence on the resulting mechanism. Based on these findings, we contribute a computational design tool that automatically creates a metamaterial mechanism from user-defined motion paths. This tool is only feasible because our novel abstract representation of the global constraints highly reduces the search space of possible cell arrangements.

Software Engineering for Smart Cyber-Physical Systems (2019)

Giese, Holger Burkhard

Currently, a transformation of our technical world into a networked technical world where besides the embedded systems with their interaction with the physical world the interconnection of these nodes in the cyber world becomes a reality can be observed. In parallel nowadays there is a strong trend to employ artificial intelligence techniques and in particular machine learning to make software behave smart. Often cyber-physical systems must be self-adaptive at the level of the individual systems to operate as elements in open, dynamic, and deviating overall structures and to adapt to open and dynamic contexts while being developed, operated, evolved, and governed independently. In this presentation, we will first discuss the envisioned future scenarios for cyber-physical systems with an emphasis on the synergies networking can offer and then characterize which challenges for the design, production, and operation of these systems result. We will then discuss to what extent our current capabilities, in particular concerning software engineering match these challenges and where substantial improvements for the software engineering are crucial. In today's software engineering for embedded systems models are used to plan systems upfront to maximize envisioned properties on the one hand and minimize cost on the other hand. When applying the same ideas to software for smart cyber-physical systems, it soon turned out that for these systems often somehow more subtle links between the involved models and the requirements, users, and environment exist. Self-adaptation and runtime models have been advocated as concepts to covers the demands that result from these subtler links. Lately, both trends have been brought together more thoroughly by the notion of self-aware computing systems. We will review the underlying causes, discuss some our work in this direction, and outline related open challenges and potential for future approaches to software engineering for smart cyber-physical systems.

On the tree conjecture for the network creation game (2019)

Bilò, Davide ; Lenzner, Pascal

Selfish Network Creation focuses on modeling real world networks from a game-theoretic point of view. One of the classic models by Fabrikant et al. (2003) is the network creation game, where agents correspond to nodes in a network which buy incident edges for the price of alpha per edge to minimize their total distance to all other nodes. The model is well-studied but still has intriguing open problems. The most famous conjectures state that the price of anarchy is constant for all alpha and that for alpha >= n all equilibrium networks are trees. We introduce a novel technique for analyzing stable networks for high edge-price alpha and employ it to improve on the best known bound for the latter conjecture. In particular we show that for alpha > 4n - 13 all equilibrium networks must be trees, which implies a constant price of anarchy for this range of alpha. Moreover, we also improve the constant upper bound on the price of anarchy for equilibrium trees.

Transforming pairwise duplicates to entity clusters for high-quality duplicate detection (2019)

Draisbach, Uwe ; Christen, Peter ; Naumann, Felix

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

Concepts and techniques for web-based visualization and processing of massive 3D point clouds with semantics (2019)

Discher, Sören ; Richter, Rico ; Döllner, Jürgen Roland Friedrich

3D point cloud technology facilitates the automated and highly detailed acquisition of real-world environments such as assets, sites, and countries. We present a web-based system for the interactive exploration and inspection of arbitrary large 3D point clouds. Our approach is able to render 3D point clouds with billions of points using spatial data structures and level-of-detail representations. Point-based rendering techniques and post-processing effects are provided to enable task-specific and data-specific filtering, e.g., based on semantics. A set of interaction techniques allows users to collaboratively work with the data (e.g., measuring distances and annotating). Additional value is provided by the system’s ability to display additional, context-providing geodata alongside 3D point clouds and to integrate processing and analysis operations. We have evaluated the presented techniques and in case studies and with different data sets from aerial, mobile, and terrestrial acquisition with up to 120 billion points to show their practicality and feasibility.

SpringFit (2019)

Roumen, Thijs ; Shigeyama, Jotaro ; Rudolph, Julius Cosmo Romeo ; Grzelka, Felix ; Baudisch, Patrick

Joints are crucial to laser cutting as they allow making three-dimensional objects; mounts are crucial because they allow embedding technical components, such as motors. Unfortunately, mounts and joints tend to fail when trying to fabricate a model on a different laser cutter or from a different material. The reason for this lies in the way mounts and joints hold objects in place, which is by forcing them into slightly smaller openings. Such "press fit" mechanisms unfortunately are susceptible to the small changes in diameter that occur when switching to a machine that removes more or less material ("kerf"), as well as to changes in stiffness, as they occur when switching to a different material. We present a software tool called springFit that resolves this problem by replacing the problematic press fit-based mounts and joints with what we call cantilever-based mounts and joints. A cantilever spring is simply a long thin piece of material that pushes against the object to be held. Unlike press fits, cantilever springs are robust against variations in kerf and material; they can even handle very high variations, simply by using longer springs. SpringFit converts models in the form of 2D cutting plans by replacing all contained mounts, notch joints, finger joints, and t-joints. In our technical evaluation, we used springFit to convert 14 models downloaded from the web.

Using Hidden Markov Models for the accurate linguistic analysis of process model activity labels (2019)

Leopold, Henrik ; van der Aa, Han ; Offenberg, Jelmer ; Reijers, Hajo A.

Many process model analysis techniques rely on the accurate analysis of the natural language contents captured in the models’ activity labels. Since these labels are typically short and diverse in terms of their grammatical style, standard natural language processing tools are not suitable to analyze them. While a dedicated technique for the analysis of process model activity labels was proposed in the past, it suffers from considerable limitations. First of all, its performance varies greatly among data sets with different characteristics and it cannot handle uncommon grammatical styles. What is more, adapting the technique requires in-depth domain knowledge. We use this paper to propose a machine learning-based technique for activity label analysis that overcomes the issues associated with this rule-based state of the art. Our technique conceptualizes activity label analysis as a tagging task based on a Hidden Markov Model. By doing so, the analysis of activity labels no longer requires the manual specification of rules. An evaluation using a collection of 15,000 activity labels demonstrates that our machine learning-based technique outperforms the state of the art in all aspects.

Performance evaluation for self-healing systems (2019)

Ghahremani, Sona ; Giese, Holger

Evaluating the performance of self-adaptive systems (SAS) is challenging due to their complexity and interaction with the often highly dynamic environment. In the context of self-healing systems (SHS), employing simulators has been shown to be the most dominant means for performance evaluation. Simulating a SHS also requires realistic fault injection scenarios. We study the state of the practice for evaluating the performance of SHS by means of a systematic literature review. We present the current practice and point out that a more thorough and careful treatment in evaluating the performance of SHS is required.

Adding Value by Combining Business and Sensor Data (2019)

Hesse, Günter ; Matthies, Christoph ; Sinzig, Werner ; Uflacker, Matthias

Industry 4.0 and the Internet of Things are recent developments that have lead to the creation of new kinds of manufacturing data. Linking this new kind of sensor data to traditional business information is crucial for enterprises to take advantage of the data’s full potential. In this paper, we present a demo which allows experiencing this data integration, both vertically between technical and business contexts and horizontally along the value chain. The tool simulates a manufacturing company, continuously producing both business and sensor data, and supports issuing ad-hoc queries that answer specific questions related to the business. In order to adapt to different environments, users can configure sensor characteristics to their needs.

Virtual machine integrity verification in Crowd-Resourcing Virtual Laboratory (2019)

Sianipar, Johannes Harungguan ; Willems, Christian ; Meinel, Christoph

In cloud computing, users are able to use their own operating system (OS) image to run a virtual machine (VM) on a remote host. The virtual machine OS is started by the user using some interfaces provided by a cloud provider in public or private cloud. In peer to peer cloud, the VM is started by the host admin. After the VM is running, the user could get a remote access to the VM to install, configure, and run services. For the security reasons, the user needs to verify the integrity of the running VM, because a malicious host admin could modify the image or even replace the image with a similar image, to be able to get sensitive data from the VM. We propose an approach to verify the integrity of a running VM on a remote host, without using any specific hardware such as Trusted Platform Module (TPM). Our approach is implemented on a Linux platform where the kernel files (vmlinuz and initrd) could be replaced with new files, while the VM is running. kexec is used to reboot the VM with the new kernel files. The new kernel has secret codes that will be used to verify whether the VM was started using the new kernel files. The new kernel is used to further measuring the integrity of the running VM.

Attribute Compartmentation and Greedy UCC Discovery for High-Dimensional Data Anonymisation (2019)

Podlesny, Nikolai Jannik ; Kayem, Anne V. D. M. ; Meinel, Christoph

High-dimensional data is particularly useful for data analytics research. In the healthcare domain, for instance, high-dimensional data analytics has been used successfully for drug discovery. Yet, in order to adhere to privacy legislation, data analytics service providers must guarantee anonymity for data owners. In the context of high-dimensional data, ensuring privacy is challenging because increased data dimensionality must be matched by an exponential growth in the size of the data to avoid sparse datasets. Syntactically, anonymising sparse datasets with methods that rely of statistical significance, makes obtaining sound and reliable results, a challenge. As such, strong privacy is only achievable at the cost of high information loss, rendering the data unusable for data analytics. In this paper, we make two contributions to addressing this problem from both the privacy and information loss perspectives. First, we show that by identifying dependencies between attribute subsets we can eliminate privacy violating attributes from the anonymised dataset. Second, to minimise information loss, we employ a greedy search algorithm to determine and eliminate maximal partial unique attribute combinations. Thus, one only needs to find the minimal set of identifying attributes to prevent re-identification. Experiments on a health cloud based on the SAP HANA platform using a semi-synthetic medical history dataset comprised of 109 attributes, demonstrate the effectiveness of our approach.

Deep En-Route Filtering of Constrained Application Protocol (CoAP) Messages on 6LoWPAN Border Routers (2019)

Seidel, Felix ; Krentz, Konrad-Felix ; Meinel, Christoph

Devices on the Internet of Things (IoT) are usually battery-powered and have limited resources. Hence, energy-efficient and lightweight protocols were designed for IoT devices, such as the popular Constrained Application Protocol (CoAP). Yet, CoAP itself does not include any defenses against denial-of-sleep attacks, which are attacks that aim at depriving victim devices of entering low-power sleep modes. For example, a denial-of-sleep attack against an IoT device that runs a CoAP server is to send plenty of CoAP messages to it, thereby forcing the IoT device to expend energy for receiving and processing these CoAP messages. All current security solutions for CoAP, namely Datagram Transport Layer Security (DTLS), IPsec, and OSCORE, fail to prevent such attacks. To fill this gap, Seitz et al. proposed a method for filtering out inauthentic and replayed CoAP messages "en-route" on 6LoWPAN border routers. In this paper, we expand on Seitz et al.'s proposal in two ways. First, we revise Seitz et al.'s software architecture so that 6LoWPAN border routers can not only check the authenticity and freshness of CoAP messages, but can also perform a wide range of further checks. Second, we propose a couple of such further checks, which, as compared to Seitz et al.'s original checks, more reliably protect IoT devices that run CoAP servers from remote denial-of-sleep attacks, as well as from remote exploits. We prototyped our solution and successfully tested its compatibility with Contiki-NG's CoAP implementation.

LoANs (2019)

Bartz, Christian ; Yang, Haojin ; Bethge, Joseph ; Meinel, Christoph

Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data.

Unified Cloud Access Control Model for Cloud Storage Broker (2019)

Sukmana, Muhammad Ihsan Haikal ; Torkura, Kennedy A. ; Graupner, Hendrik ; Cheng, Feng ; Meinel, Christoph

Cloud Storage Broker (CSB) provides value-added cloud storage service for enterprise usage by leveraging multi-cloud storage architecture. However, it raises several challenges for managing resources and its access control in multiple Cloud Service Providers (CSPs) for authorized CSB stakeholders. In this paper we propose unified cloud access control model that provides the abstraction of CSP's services for centralized and automated cloud resource and access control management in multiple CSPs. Our proposal offers role-based access control for CSB stakeholders to access cloud resources by assigning necessary privileges and access control list for cloud resources and CSB stakeholders, respectively, following privilege separation concept and least privilege principle. We implement our unified model in a CSB system called CloudRAID for Business (CfB) with the evaluation result shows it provides system-and-cloud level security service for cfB and centralized resource and access control management in multiple CSPs.

Full Lecture Recording Watching Behavior, or Why Students Watch 90-Min Lectures in 5 Min (2019)

Bauer, Matthias ; Malchow, Martin ; Meinel, Christoph

Many universities record the lectures being held in their facilities to preserve knowledge and to make it available to their students and, at least for some universities and classes, to the broad public. The way with the least effort is to record the whole lecture, which in our case usually is 90 min long. This saves the labor and time of cutting and rearranging lectures scenes to provide short learning videos as known from Massive Open Online Courses (MOOCs), etc. Many lecturers fear that recording their lectures and providing them via an online platform might lead to less participation in the actual lecture. Also, many teachers fear that the lecture recordings are not used with the same focus and dedication as lectures in a lecture hall. In this work, we show that in our experience, full lectures have an average watching duration of just a few minutes and explain the reasons for that and why, in most cases, teachers do not have to worry about that.

A quantifiable trustmModel for Blockchain-based identity management (2019)

Grüner, Andreas ; Mühle, Alexander ; Gayvoronskaya, Tatiana ; Meinel, Christoph

Integrating Biological Context into the Analysis of Gene Expression Data (2019)

Perscheid, Cindy ; Uflacker, Matthias

High-throughput RNA sequencing produces large gene expression datasets whose analysis leads to a better understanding of diseases like cancer. The nature of RNA-Seq data poses challenges to its analysis in terms of its high dimensionality, noise, and complexity of the underlying biological processes. Researchers apply traditional machine learning approaches, e. g. hierarchical clustering, to analyze this data. Until it comes to validation of the results, the analysis is based on the provided data only and completely misses the biological context. However, gene expression data follows particular patterns - the underlying biological processes. In our research, we aim to integrate the available biological knowledge earlier in the analysis process. We want to adapt state-of-the-art data mining algorithms to consider the biological context in their computations and deliver meaningful results for researchers.

From face to face (2019)

Drimalla, Hanna ; Landwehr, Niels ; Hess, Ursula ; Dziobek, Isabel

Despite advances in the conceptualisation of facial mimicry, its role in the processing of social information is a matter of debate. In the present study, we investigated the relationship between mimicry and cognitive and emotional empathy. To assess mimicry, facial electromyography was recorded for 70 participants while they completed the Multifaceted Empathy Test, which presents complex context-embedded emotional expressions. As predicted, inter-individual differences in emotional and cognitive empathy were associated with the level of facial mimicry. For positive emotions, the intensity of the mimicry response scaled with the level of state emotional empathy. Mimicry was stronger for the emotional empathy task compared to the cognitive empathy task. The specific empathy condition could be successfully detected from facial muscle activity at the level of single individuals using machine learning techniques. These results support the view that mimicry occurs depending on the social context as a tool to affiliate and it is involved in cognitive as well as emotional empathy.

On integrating design thinking for human-centered requirements engineering (2019)

Hehn, Jennifer ; Mendez, Daniel ; Uebernickel, Falk ; Brenner, Walter ; Broy, Manfred

We elaborate on the possibilities and needs to integrate design thinking into requirements engineering, drawing from our research and project experiences. We suggest three approaches for tailoring and integrating design thinking and requirements engineering with complementary synergies and point at open challenges for research and practice.

An Exploratory Study to Detect Temporal Orientation Using Bluetooth's sensor (2019)

Hernandez, Netzahualcoyotl ; Demiray, Burcu ; Arnrich, Bert ; Favela, Jesus

Mobile sensing technology allows us to investigate human behaviour on a daily basis. In the study, we examined temporal orientation, which refers to the capacity of thinking or talking about personal events in the past and future. We utilise the mksense platform that allows us to use the experience-sampling method. Individual's thoughts and their relationship with smartphone's Bluetooth data is analysed to understand in which contexts people are influenced by social environments, such as the people they spend the most time with. As an exploratory study, we analyse social condition influence through a collection of Bluetooth data and survey information from participant's smartphones. Preliminary results show that people are likely to focus on past events when interacting with close-related people, and focus on future planning when interacting with strangers. Similarly, people experience present temporal orientation when accompanied by known people. We believe that these findings are linked to emotions since, in its most basic state, emotion is a state of physiological arousal combined with an appropriated cognition. In this contribution, we envision a smartphone application for automatically inferring human emotions based on user's temporal orientation by using Bluetooth sensors, we briefly elaborate on the influential factor of temporal orientation episodes and conclude with a discussion and lessons learned.

Magic mirror in my hand, which is the best in the land? (2020)

Kossmann, Jan ; Halfpap, Stefan ; Jankrift, Marcel ; Schlosser, Rainer

Indexes are essential for the efficient processing of database workloads. Proposed solutions for the relevant and challenging index selection problem range from metadata-based simple heuristics, over sophisticated multi-step algorithms, to approaches that yield optimal results. The main challenges are (i) to accurately determine the effect of an index on the workload cost while considering the interaction of indexes and (ii) a large number of possible combinations resulting from workloads containing many queries and massive schemata with possibly thousands of attributes. In this work, we describe and analyze eight index selection algorithms that are based on different concepts and compare them along different dimensions, such as solution quality, runtime, multi-column support, solution granularity, and complexity. In particular, we analyze the solutions of the algorithms for the challenging analytical Join Order, TPC-H, and TPC-DS benchmarks. Afterward, we assess strengths and weaknesses, infer insights for index selection in general and each approach individually, before we give recommendations on when to use which approach.

MDedup (2020)

Koumarelas, Ioannis ; Papenbrock, Thorsten ; Naumann, Felix

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.

Risk-sensitive control of Markov decision processes (2020)

Schlosser, Rainer

In many revenue management applications risk-averse decision-making is crucial. In dynamic settings, however, it is challenging to find the right balance between maximizing expected rewards and minimizing various kinds of risk. In existing approaches utility functions, chance constraints, or (conditional) value at risk considerations are used to influence the distribution of rewards in a preferred way. Nevertheless, common techniques are not flexible enough and typically numerically complex. In our model, we exploit the fact that a distribution is characterized by its mean and higher moments. We present a multi-valued dynamic programming heuristic to compute risk-sensitive feedback policies that are able to directly control the moments of future rewards. Our approach is based on recursive formulations of higher moments and does not require an extension of the state space. Finally, we propose a self-tuning algorithm, which allows to identify feedback policies that approximate predetermined (risk-sensitive) target distributions. We illustrate the effectiveness and the flexibility of our approach for different dynamic pricing scenarios. (C) 2020 Elsevier Ltd. All rights reserved.

Data preparation for duplicate detection (2020)

Koumarelas, Ioannis ; Jiang, Lan ; Naumann, Felix

Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

Self-driving database systems (2020)

Kossmann, Jan ; Schlosser, Rainer

Challenges for self-driving database systems, which tune their physical design and configuration autonomously, are manifold: Such systems have to anticipate future workloads, find robust configurations efficiently, and incorporate knowledge gained by previous actions into later decisions. We present a component-based framework for self-driving database systems that enables database integration and development of self-managing functionality with low overhead by relying on separation of concerns. By keeping the components of the framework reusable and exchangeable, experiments are simplified, which promotes further research in that area. Moreover, to optimize multiple mutually dependent features, e.g., index selection and compression configurations, we propose a linear programming (LP) based algorithm to derive an efficient tuning order automatically. Afterwards, we demonstrate the applicability and scalability of our approach with reproducible examples.

Geospatial artificial intelligence (2020)

Döllner, Jürgen Roland Friedrich

Artificial intelligence (AI) is changing fundamentally the way how IT solutions are implemented and operated across all application domains, including the geospatial domain. This contribution outlines AI-based techniques for 3D point clouds and geospatial digital twins as generic components of geospatial AI. First, we briefly reflect on the term "AI" and outline technology developments needed to apply AI to IT solutions, seen from a software engineering perspective. Next, we characterize 3D point clouds as key category of geodata and their role for creating the basis for geospatial digital twins; we explain the feasibility of machine learning (ML) and deep learning (DL) approaches for 3D point clouds. In particular, we argue that 3D point clouds can be seen as a corpus with similar properties as natural language corpora and formulate a "Naturalness Hypothesis" for 3D point clouds. In the main part, we introduce a workflow for interpreting 3D point clouds based on ML/DL approaches that derive domain-specific and application-specific semantics for 3D point clouds without having to create explicit spatial 3D models or explicit rule sets. Finally, examples are shown how ML/DL enables us to efficiently build and maintain base data for geospatial digital twins such as virtual 3D city models, indoor models, or building information models.

More than a quarter century of creativity and innovation management (2020)

Rose, Robert ; Hölzle, Katharina ; Björk, Jennie

When this journal was founded in 1992 by Tudor Rickards and Susan Moger, there was no academic outlet available that addressed issues at the intersection of creativity and innovation. From zero to 1,163 records, from the new kid on the block to one of the leading journals in creativity and innovation management has been quite a journey, and we would like to reflect on the past 28 years and the intellectual and conceptual structure of Creativity and Innovation Management (CIM). Specifically, we highlight milestones and influential articles, identify how key journal characteristics evolved, outline the (co-)authorship structure, and finally, map the thematic landscape of CIM by means of a text-mining analysis. This study represents the first systematic and comprehensive assessment of the journal's published body of knowledge and helps to understand the journal's influence on the creativity and innovation management community. We conclude by discussing future topics and paths of the journal as well as limitations of our approach.

IMDfence (2020)

Siddiqi, Muhammad Ali ; Dörr, Christian ; Strydis, Christos

Over the past decade, focus on the security and privacy aspects of implantable medical devices (IMDs) has intensified, driven by the multitude of cybersecurity vulnerabilities found in various existing devices. However, due to their strict computational, energy and physical constraints, conventional security protocols are not directly applicable to IMDs. Custom-tailored schemes have been proposed instead which, however, fail to cover the full spectrum of security features that modern IMDs and their ecosystems so critically require. In this paper we propose IMDfence, a security protocol for IMD ecosystems that provides a comprehensive yet practical security portfolio, which includes availability, non-repudiation, access control, entity authentication, remote monitoring and system scalability. The protocol also allows emergency access that results in the graceful degradation of offered services without compromising security and patient safety. The performance of the security protocol as well as its feasibility and impact on modern IMDs are extensively analyzed and evaluated. We find that IMDfence achieves the above security requirements at a mere less than 7% increase in total IMD energy consumption, and less than 14 ms and 9 kB increase in system delay and memory footprint, respectively.

Data Preparation (2020)

Hameed, Mazhar ; Naumann, Felix

Raw data are often messy: they follow different encodings, records are not well structured, values do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. To cater to this demand, companies and researchers are developing techniques and tools for data preparation. To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.

Greed is good for deterministic scale-free networks (2020)

Chauhan, Ankit ; Friedrich, Tobias ; Rothenberger, Ralf

Large real-world networks typically follow a power-law degree distribution. To study such networks, numerous random graph models have been proposed. However, real-world networks are not drawn at random. Therefore, Brach et al. (27th symposium on discrete algorithms (SODA), pp 1306-1325, 2016) introduced two natural deterministic conditions: (1) a power-law upper bound on the degree distribution (PLB-U) and (2) power-law neighborhoods, that is, the degree distribution of neighbors of each vertex is also upper bounded by a power law (PLB-N). They showed that many real-world networks satisfy both properties and exploit them to design faster algorithms for a number of classical graph problems. We complement their work by showing that some well-studied random graph models exhibit both of the mentioned PLB properties. PLB-U and PLB-N hold with high probability for Chung-Lu Random Graphs and Geometric Inhomogeneous Random Graphs and almost surely for Hyperbolic Random Graphs. As a consequence, all results of Brach et al. also hold with high probability or almost surely for those random graph classes. In the second part we study three classical NP-hard optimization problems on PLB networks. It is known that on general graphs with maximum degree Delta, a greedy algorithm, which chooses nodes in the order of their degree, only achieves a Omega (ln Delta)-approximation forMinimum Vertex Cover and Minimum Dominating Set, and a Omega(Delta)-approximation forMaximum Independent Set. We prove that the PLB-U property with beta>2 suffices for the greedy approach to achieve a constant-factor approximation for all three problems. We also show that these problems are APX-hard even if PLB-U, PLB-N, and an additional power-law lower bound on the degree distribution hold. Hence, a PTAS cannot be expected unless P = NP. Furthermore, we prove that all three problems are in MAX SNP if the PLB-U property holds.

Hybrid search plan generation for generalized graph pattern matching (2020)

Barkowsky, Matthias ; Giese, Holger

In recent years, the increased interest in application areas such as social networks has resulted in a rising popularity of graph-based approaches for storing and processing large amounts of interconnected data. To extract useful information from the growing network structures, efficient querying techniques are required. In this paper, we propose an approach for graph pattern matching that allows a uniform handling of arbitrary constraints over the query vertices. Our technique builds on a previously introduced matching algorithm, which takes concrete host graph information into account to dynamically adapt the employed search plan during query execution. The dynamic algorithm is combined with an existing static approach for search plan generation, resulting in a hybrid technique which we further extend by a more sophisticated handling of filtering effects caused by constraint checks. We evaluate the presented concepts empirically based on an implementation for our graph pattern matching tool, the Story Diagram Interpreter, with queries and data provided by the LDBC Social Network Benchmark. Our results suggest that the hybrid technique may improve search efficiency in several cases, and rarely reduces efficiency.

Scalable relaxation techniques to solve stochastic dynamic multi-product pricing problems with substitution effects (2020)

Schlosser, Rainer

In many businesses, firms are selling different types of products, which share mutual substitution effects in demand. To compute effective pricing strategies is challenging as the sales probabilities of each of a firm's products can also be affected by the prices of potential substitutes. In this paper, we analyze stochastic dynamic multi-product pricing models for the sale of perishable goods. To circumvent the limitations of time-consuming optimal solutions for highly complex models, we propose different relaxation techniques, which allow to reduce the size of critical model components, such as the state space, the action space, or the set of potential sales events. Our heuristics are able to decrease the size of those components by forming corresponding clusters and using subsets of representative elements. Using numerical examples, we verify that our heuristics make it possible to dramatically reduce the computation time while still obtaining close-to-optimal expected profits. Further, we show that our heuristics are (i) flexible, (ii) scalable, and (iii) can be arbitrarily combined in a mutually supportive way.

Efficient discovery of matching dependencies (2020)

Schirmer, Philipp ; Papenbrock, Thorsten ; Koumarelas, Ioannis ; Naumann, Felix

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

Explainable AI under contract and tort law (2020)

Hacker, Philipp ; Krestel, Ralf ; Grundmann, Stefan ; Naumann, Felix

This paper shows that the law, in subtle ways, may set hitherto unrecognized incentives for the adoption of explainable machine learning applications. In doing so, we make two novel contributions. First, on the legal side, we show that to avoid liability, professional actors, such as doctors and managers, may soon be legally compelled to use explainable ML models. We argue that the importance of explainability reaches far beyond data protection law, and crucially influences questions of contractual and tort liability for the use of ML models. To this effect, we conduct two legal case studies, in medical and corporate merger applications of ML. As a second contribution, we discuss the (legally required) trade-off between accuracy and explainability and demonstrate the effect in a technical case study in the context of spam classification.

Destructiveness of lexicographic parsimony pressure and alleviation by a concatenation crossover in genetic programming (2020)

Kötzing, Timo ; Lagodzinski, Gregor J. A. ; Lengler, Johannes ; Melnichenko, Anna

For theoretical analyses there are two specifics distinguishing GP from many other areas of evolutionary computation: the variable size representations, in particular yielding a possible bloat (i.e. the growth of individuals with redundant parts); and also the role and the realization of crossover, which is particularly central in GP due to the tree-based representation. Whereas some theoretical work on GP has studied the effects of bloat, crossover had surprisingly little share in this work. We analyze a simple crossover operator in combination with randomized local search, where a preference for small solutions minimizes bloat (lexicographic parsimony pressure); we denote the resulting algorithm Concatenation Crossover GP. We consider three variants of the well-studied MAJORITY test function, adding large plateaus in different ways to the fitness landscape and thus giving a test bed for analyzing the interplay of variation operators and bloat control mechanisms in a setting with local optima. We show that the Concatenation Crossover GP can efficiently optimize these test functions, while local search cannot be efficient for all three variants independent of employing bloat control. (C) 2019 Elsevier B.V. All rights reserved.

Bridge damage (2020)

Isailović, Dušan ; Stojanovic, Vladeta ; Trapp, Matthias ; Richter, Rico ; Hajdin, Rade ; Döllner, Jürgen Roland Friedrich

Building Information Modeling (BIM) representations of bridges enriched by inspection data will add tremendous value to future Bridge Management Systems (BMSs). This paper presents an approach for point cloud-based detection of spalling damage, as well as integrating damage components into a BIM via semantic enrichment of an as-built Industry Foundation Classes (IFC) model. An approach for generating the as-built BIM, geometric reconstruction of detected damage point clusters and semantic-enrichment of the corresponding IFC model is presented. Multiview-classification is used and evaluated for the detection of spalling damage features. The semantic enrichment of as-built IFC models is based on injecting classified and reconstructed damage clusters back into the as-built IFC, thus generating an accurate as-is IFC model compliant to the BMS inspection requirements.

Preface to the special issue on the 11th International Conference on Graph Transformation (2020)

Lambers, Leen ; Weber, Jens

This special issue contains extended versions of four selected papers from the 11th International Conference on Graph Transformation (ICGT 2018). The articles cover a tool for computing core graphs via SAT/SMT solvers (graph language definition), graph transformation through graph surfing in reaction systems (a new graph transformation formalism), the essence and initiality of conflicts in M-adhesive transformation systems, and a calculus of concurrent graph-rewriting processes (theory on conflicts and parallel independence).

Mary, Hugo, and Hugo* (2020)

Thamsen, Lauritz ; Beilharz, Jossekin Jakob ; Vinh Thuy Tran ; Nedelkoski, Sasho ; Kao, Odej

Distributed data-parallel processing systems like MapReduce, Spark, and Flink are popular for analyzing large datasets using cluster resources. Resource management systems like YARN or Mesos in turn allow multiple data-parallel processing jobs to share cluster resources in temporary containers. Often, the containers do not isolate resource usage to achieve high degrees of overall resource utilization despite overprovisioning and the often fluctuating utilization of specific jobs. However, some combinations of jobs utilize resources better and interfere less with each other when running on the same shared nodes than others. This article presents an approach for improving the resource utilization and job throughput when scheduling recurring distributed data-parallel processing jobs in shared clusters. The approach is based on reinforcement learning and a measure of co-location goodness to have cluster schedulers learn over time which jobs are best executed together on shared resources. We evaluated this approach over the last years with three prototype schedulers that build on each other: Mary, Hugo, and Hugo*. For the evaluation we used exemplary Flink and Spark jobs from different application domains and clusters of commodity nodes managed by YARN. The results of these experiments show that our approach can increase resource utilization and job throughput significantly.

Lower bounds on the run time of the Univariate Marginal Distribution Algorithm on OneMax (2020)

Krejca, Martin S. ; Witt, Carsten

The Univariate Marginal Distribution Algorithm (UMDA) - a popular estimation-of-distribution algorithm - is studied from a run time perspective. On the classical OneMax benchmark function on bit strings of length n, a lower bound of Omega(lambda + mu root n + n logn), where mu and lambda are algorithm-specific parameters, on its expected run time is proved. This is the first direct lower bound on the run time of UMDA. It is stronger than the bounds that follow from general black-box complexity theory and is matched by the run time of many evolutionary algorithms. The results are obtained through advanced analyses of the stochastic change of the frequencies of bit values maintained by the algorithm, including carefully designed potential functions. These techniques may prove useful in advancing the field of run time analysis for estimation-of-distribution algorithms in general.

Hitting set enumeration with partial information for unique column combination discovery (2020)

Birnick, Johann ; Bläsius, Thomas ; Friedrich, Tobias ; Naumann, Felix ; Papenbrock, Thorsten ; Schirneck, Friedrich Martin

Unique column combinations (UCCs) are a fundamental concept in relational databases. They identify entities in the data and support various data management activities. Still, UCCs are usually not explicitly defined and need to be discovered. State-of-the-art data profiling algorithms are able to efficiently discover UCCs in moderately sized datasets, but they tend to fail on large and, in particular, on wide datasets due to run time and memory limitations. In this paper, we introduce HPIValid, a novel UCC discovery algorithm that implements a faster and more resource-saving search strategy. HPIValid models the metadata discovery as a hitting set enumeration problem in hypergraphs. In this way, it combines efficient discovery techniques from data profiling research with the most recent theoretical insights into enumeration algorithms. Our evaluation shows that HPIValid is not only orders of magnitude faster than related work, it also has a much smaller memory footprint.

On the complexity of the smallest grammar problem over fixed alphabets (2020)

Casel, Katrin ; Fernau, Henning ; Gaspers, Serge ; Gras, Benjamin ; Schmid, Markus L.

In the smallest grammar problem, we are given a word w and we want to compute a preferably small context-free grammar G for the singleton language {w} (where the size of a grammar is the sum of the sizes of its rules, and the size of a rule is measured by the length of its right side). It is known that, for unbounded alphabets, the decision variant of this problem is NP-hard and the optimisation variant does not allow a polynomial-time approximation scheme, unless P = NP. We settle the long-standing open problem whether these hardness results also hold for the more realistic case of a constant-size alphabet. More precisely, it is shown that the smallest grammar problem remains NP-complete (and its optimisation version is APX-hard), even if the alphabet is fixed and has size of at least 17. The corresponding reduction is robust in the sense that it also works for an alternative size-measure of grammars that is commonly used in the literature (i. e., a size measure also taking the number of rules into account), and it also allows to conclude that even computing the number of rules required by a smallest grammar is a hard problem. On the other hand, if the number of nonterminals (or, equivalently, the number of rules) is bounded by a constant, then the smallest grammar problem can be solved in polynomial time, which is shown by encoding it as a problem on graphs with interval structure. However, treating the number of rules as a parameter (in terms of parameterised complexity) yields W[1]-hardness. Furthermore, we present an O(3(vertical bar w vertical bar)) exact exponential-time algorithm, based on dynamic programming. These three main questions are also investigated for 1-level grammars, i. e., grammars for which only the start rule contains nonterminals on the right side; thus, investigating the impact of the "hierarchical depth" of grammars on the complexity of the smallest grammar problem. In this regard, we obtain for 1-level grammars similar, but slightly stronger results.

Human centered AI design for clinical monitoring and data management (2020)

Adnan, Hassan Sami ; Matthews, Sam ; Hackl, M. ; Das, P. P. ; Manaswini, Manisha ; Gadamsetti, S. ; Filali, Maroua ; Owoyele, Babajide ; Santuber, Joaquín ; Edelman, Jonathan

In clinical settings, significant resources are spent on data collection and monitoring patients' health parameters to improve decision-making and provide better care. With increased digitization, the healthcare sector is shifting towards implementing digital technologies for data management and in administration. New technologies offer better treatment opportunities and streamline clinical workflow, but the complexity can cause ineffectiveness, frustration, and errors. To address this, we believe digital solutions alone are not sufficient. Therefore, we take a human-centred design approach for AI development, and apply systems engineering methods to identify system leverage points. We demonstrate how automation enables monitoring clinical parameters, using existing non-intrusive sensor technology, resulting in more resources toward patient care. Furthermore, we provide a framework on digitization of clinical data for integration with data management.

Mobile phone ownership and willingness to receive mHealth services among patients with diabetes mellitus in South-West, Nigeria (2020)

Olamoyegun, Michael Adeyemi ; Raimi, Taiwo Hassan ; Ala, Oluwabukola Ayodele ; Fadare, Joseph Olusesan

Introduction: mobile phone technology is increasingly used to overcome traditional barriers to limiting access to diabetes care. This study evaluated mobile phone ownership and willingness to receive and pay for mobile phone-based diabetic services among people with diabetes in South-West, Nigeria. Methods: two hundred and fifty nine patients with diabetes were consecutively recruited from three tertiary health institutions in South-West, Nigeria. Questionnaire was used to evaluate mobile phone ownership, willingness to receive and pay for mobile phone-based diabetic health care services via voice call and text messaging. Results: 97.3% owned a mobile phone, with 38.9% and 61.1% owning smartphone and basic phone respectively. Males were significantly more willing to receive mobile-phone-based health services than females (81.1% vs 68.1%, p=0.025), likewise married compared to unmarried [77.4% vs 57.1%, p=0.0361. Voice calls (41.3%) and text messages (32.4%), were the most preferred modes of receiving diabetes-related health education with social media (3.1%) and email (1.5%) least. Almost three-quarter of participants (72.6%) who owned mobile phone, were willing to receive mobile phone-based diabetes health services. The educational status of patients (adjusted OR [AORJ: 1.7(95% CI: 1.6 to 2.11), glucometers possession (ACM: 2.0 [95% CI: 1.9 to 2.1) and type of mobile phone owned (AOR: 2.9 [95% CI: 2.8 to 5.0]) were significantly associated with the willingness to receive mobile phone-based diabetic services. Conclusion: the majority of study participants owned mobile phones and would be willing to receive and pay for diabetes-related healthcare delivery services provided the cost is minimal and affordable.

Cyber threat intelligence: A product without a process? (2020)

Oosthoek, Kris ; Doerr, Christian

Purely attention based local feature integration for video classification (2020)

Long, Xiang ; de Melo, Gerard ; He, Dongliang ; Li, Fu ; Chi, Zhizhen ; Wen, Shilei ; Gan, Chuang

Recently, substantial research effort has focused on how to apply CNNs or RNNs to better capture temporal patterns in videos, so as to improve the accuracy of video classification. In this paper, we investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we first propose Basic Attention Clusters (BAC), which concatenates the output of multiple attention units applied in parallel, and introduce a shifting operation to capture more diverse signals. Experiments show that BAC can achieve excellent results on multiple datasets. However, BAC treats all feature channels as an indivisible whole, which is suboptimal for achieving a finer-grained local feature integration over the channel dimension. Additionally, it treats the entire local feature sequence as an unordered set, thus ignoring the sequential relationships. To improve over BAC, we further propose the channel pyramid attention schema by splitting features into sub-features at multiple scales for coarse-to-fine sub-feature interaction modeling, and propose the temporal pyramid attention schema by dividing the feature sequences into ordered sub-sequences of multiple lengths to account for the sequential order. Our final model pyramidxpyramid attention clusters (PPAC) combines both channel pyramid attention and temporal pyramid attention to focus on the most important sub-features, while also preserving the temporal information of the video. We demonstrate the effectiveness of PPAC on seven real-world video classification datasets. Our model achieves competitive results across all of these, showing that our proposed framework can consistently outperform the existing local feature integration methods across a range of different scenarios.

The impact of lexicographic parsimony pressure for ORDER/MAJORITY on the run time (2020)

Doerr, Benjamin ; Kötzing, Timo ; Lagodzinski, Gregor J. A. ; Lengler, Johannes

While many optimization problems work with a fixed number of decision variables and thus a fixed-length representation of possible solutions, genetic programming (GP) works on variable-length representations. A naturally occurring problem is that of bloat, that is, the unnecessary growth of solution lengths, which may slow down the optimization process. So far, the mathematical runtime analysis could not deal well with bloat and required explicit assumptions limiting bloat. In this paper, we provide the first mathematical runtime analysis of a GP algorithm that does not require any assumptions on the bloat. Previous performance guarantees were only proven conditionally for runs in which no strong bloat occurs. Together with improved analyses for the case with bloat restrictions our results show that such assumptions on the bloat are not necessary and that the algorithm is efficient without explicit bloat control mechanism. More specifically, we analyzed the performance of the (1 + 1) GP on the two benchmark functions ORDER and MAJORITY. When using lexicographic parsimony pressure as bloat control, we show a tight runtime estimate of O(T-init + nlogn) iterations both for ORDER and MAJORITY. For the case without bloat control, the bounds O(T-init logT(i)(nit) + n(logn)(3)) and Omega(T-init + nlogn) (and Omega(T-init log T-init) for n = 1) hold for MAJORITY(1).

Significance-based estimation-of-distribution algorithms (2020)

Doerr, Benjamin ; Krejca, Martin S.

Estimation-of-distribution algorithms (EDAs) are randomized search heuristics that create a probabilistic model of the solution space, which is updated iteratively, based on the quality of the solutions sampled according to the model. As previous works show, this iteration-based perspective can lead to erratic updates of the model, in particular, to bit-frequencies approaching a random boundary value. In order to overcome this problem, we propose a new EDA based on the classic compact genetic algorithm (cGA) that takes into account a longer history of samples and updates its model only with respect to information which it classifies as statistically significant. We prove that this significance-based cGA (sig-cGA) optimizes the commonly regarded benchmark functions OneMax (OM), LeadingOnes, and BinVal all in quasilinear time, a result shown for no other EDA or evolutionary algorithm so far. For the recently proposed stable compact genetic algorithm-an EDA that tries to prevent erratic model updates by imposing a bias to the uniformly distributed model-we prove that it optimizes OM only in a time exponential in its hypothetical population size. Similarly, we show that the convex search algorithm cannot optimize OM in polynomial time.

Using AI for mental health analysis and prediction in school surveys (2020)

Adnan, Hassan Sami ; Srsic, Amanda ; Venticich, Pete Milos ; Townend, David M.R.

Background: Childhood and adolescence are critical stages of life for mental health and well-being. Schools are a key setting for mental health promotion and illness prevention. One in five children and adolescents have a mental disorder, about half of mental disorders beginning before the age of 14. Beneficial and explainable artificial intelligence can replace current paper- based and online approaches to school mental health surveys. This can enhance data acquisition, interoperability, data driven analysis, trust and compliance. This paper presents a model for using chatbots for non-obtrusive data collection and supervised machine learning models for data analysis; and discusses ethical considerations pertaining to the use of these models. Methods: For data acquisition, the proposed model uses chatbots which interact with students. The conversation log acts as the source of raw data for the machine learning. Pre-processing of the data is automated by filtering for keywords and phrases. Existing survey results, obtained through current paper-based data collection methods, are evaluated by domain experts (health professionals). These can be used to create a test dataset to validate the machine learning models. Supervised learning can then be deployed to classify specific behaviour and mental health patterns. Results: We present a model that can be used to improve upon current paper-based data collection and manual data analysis methods. An open-source GitHub repository contains necessary tools and components of this model. Privacy is respected through rigorous observance of confidentiality and data protection requirements. Critical reflection on these ethics and law aspects is included in the project. Conclusions: This model strengthens mental health surveillance in schools. The same tools and components could be applied to other public health data. Future extensions of this model could also incorporate unsupervised learning to find clusters and patterns of unknown effects.

Improving scalability and reward of utility-driven self-healing for large dynamic architectures (2020)

Ghahremani, Sona ; Giese, Holger ; Vogel, Thomas

Self-adaptation can be realized in various ways. Rule-based approaches prescribe the adaptation to be executed if the system or environment satisfies certain conditions. They result in scalable solutions but often with merely satisfying adaptation decisions. In contrast, utility-driven approaches determine optimal decisions by using an often costly optimization, which typically does not scale for large problems. We propose a rule-based and utility-driven adaptation scheme that achieves the benefits of both directions such that the adaptation decisions are optimal, whereas the computation scales by avoiding an expensive optimization. We use this adaptation scheme for architecture-based self-healing of large software systems. For this purpose, we define the utility for large dynamic architectures of such systems based on patterns that define issues the self-healing must address. Moreover, we use pattern-based adaptation rules to resolve these issues. Using a pattern-based scheme to define the utility and adaptation rules allows us to compute the impact of each rule application on the overall utility and to realize an incremental and efficient utility-driven self-healing. In addition to formally analyzing the computational effort and optimality of the proposed scheme, we thoroughly demonstrate its scalability and optimality in terms of reward in comparative experiments with a static rule-based approach as a baseline and a utility-driven approach using a constraint solver. These experiments are based on different failure profiles derived from real-world failure logs. We also investigate the impact of different failure profile characteristics on the scalability and reward to evaluate the robustness of the different approaches.

Evaluation of self-healing systems (2020)

Ghahremani, Sona ; Giese, Holger

Evaluating the performance of self-adaptive systems is challenging due to their interactions with often highly dynamic environments. In the specific case of self-healing systems, the performance evaluations of self-healing approaches and their parameter tuning rely on the considered characteristics of failure occurrences and the resulting interactions with the self-healing actions. In this paper, we first study the state-of-the-art for evaluating the performances of self-healing systems by means of a systematic literature review. We provide a classification of different input types for such systems and analyse the limitations of each input type. A main finding is that the employed inputs are often not sophisticated regarding the considered characteristics for failure occurrences. To further study the impact of the identified limitations, we present experiments demonstrating that wrong assumptions regarding the characteristics of the failure occurrences can result in large performance prediction errors, disadvantageous design-time decisions concerning the selection of alternative self-healing approaches, and disadvantageous deployment-time decisions concerning parameter tuning. Furthermore, the experiments indicate that employing multiple alternative input characteristics can help with reducing the risk of premature disadvantageous design-time decisions.

CloudStrike (2020)

Torkura, Kennedy A. ; Sukmana, Muhammad Ihsan Haikal ; Cheng, Feng ; Meinel, Christoph

Most cyber-attacks and data breaches in cloud infrastructure are due to human errors and misconfiguration vulnerabilities. Cloud customer-centric tools are imperative for mitigating these issues, however existing cloud security models are largely unable to tackle these security challenges. Therefore, novel security mechanisms are imperative, we propose Risk-driven Fault Injection (RDFI) techniques to address these challenges. RDFI applies the principles of chaos engineering to cloud security and leverages feedback loops to execute, monitor, analyze and plan security fault injection campaigns, based on a knowledge-base. The knowledge-base consists of fault models designed from secure baselines, cloud security best practices and observations derived during iterative fault injection campaigns. These observations are helpful for identifying vulnerabilities while verifying the correctness of security attributes (integrity, confidentiality and availability). Furthermore, RDFI proactively supports risk analysis and security hardening efforts by sharing security information with security mechanisms. We have designed and implemented the RDFI strategies including various chaos engineering algorithms as a software tool: CloudStrike. Several evaluations have been conducted with CloudStrike against infrastructure deployed on two major public cloud infrastructure: Amazon Web Services and Google Cloud Platform. The time performance linearly increases, proportional to increasing attack rates. Also, the analysis of vulnerabilities detected via security fault injection has been used to harden the security of cloud resources to demonstrate the effectiveness of the security information provided by CloudStrike. Therefore, we opine that our approaches are suitable for overcoming contemporary cloud security issues.

Generative multi-adversarial network for striking the right balance in abdominal image segmentation (2020)

Rezaei, Mina ; Näppi, Janne J. ; Lippert, Christoph ; Meinel, Christoph ; Yoshida, Hiroyuki

Purpose: The identification of abnormalities that are relatively rare within otherwise normal anatomy is a major challenge for deep learning in the semantic segmentation of medical images. The small number of samples of the minority classes in the training data makes the learning of optimal classification challenging, while the more frequently occurring samples of the majority class hamper the generalization of the classification boundary between infrequently occurring target objects and classes. In this paper, we developed a novel generative multi-adversarial network, called Ensemble-GAN, for mitigating this class imbalance problem in the semantic segmentation of abdominal images. Method: The Ensemble-GAN framework is composed of a single-generator and a multi-discriminator variant for handling the class imbalance problem to provide a better generalization than existing approaches. The ensemble model aggregates the estimates of multiple models by training from different initializations and losses from various subsets of the training data. The single generator network analyzes the input image as a condition to predict a corresponding semantic segmentation image by use of feedback from the ensemble of discriminator networks. To evaluate the framework, we trained our framework on two public datasets, with different imbalance ratios and imaging modalities: the Chaos 2019 and the LiTS 2017. Result: In terms of the F1 score, the accuracies of the semantic segmentation of healthy spleen, liver, and left and right kidneys were 0.93, 0.96, 0.90 and 0.94, respectively. The overall F1 scores for simultaneous segmentation of the lesions and liver were 0.83 and 0.94, respectively. Conclusion: The proposed Ensemble-GAN framework demonstrated outstanding performance in the semantic segmentation of medical images in comparison with other approaches on popular abdominal imaging benchmarks. The Ensemble-GAN has the potential to segment abdominal images more accurately than human experts.

What are we talking about when we talk about technology pivots? (2020)

Bohn, Nicolai ; Kundisch, Dennis

Technology pivots were designed to help digital startups make adjustments to the technology underpinning their products and services. While academia and the media make liberal use of the term "technology pivot," they rarely align themselves to Ries' foundational conceptualization. Recent research suggests that a more granulated conceptualization of technology pivots is required. To scientifically derive a comprehensive conceptualization, we conduct a Delphi study with a panel of 38 experts drawn from academia and practice to explore their understanding of "technology pivots." Our study thus makes an important contribution to advance the seminal work by Ries on technology pivots.

Affect-aware word clouds (2020)

Kulahcioglu, Tugba ; Melo, Gerard de

Word clouds are widely used for non-analytic purposes, such as introducing a topic to students, or creating a gift with personally meaningful text. Surveys show that users prefer tools that yield word clouds with a stronger emotional impact. Fonts and color palettes are powerful typographical signals that may determine this impact. Typically, these signals are assigned randomly, or expected to be chosen by the users. We present an affect-aware font and color palette selection methodology that aims to facilitate more informed choices. We infer associations of fonts with a set of eight affects, and evaluate the resulting data in a series of user studies both on individual words as well as in word clouds. Relying on a recent study to procure affective color palettes, we carry out a similar user study to understand the impact of color choices on word clouds. Our findings suggest that both fonts and color palettes are powerful tools contributing to the affects evoked by a word cloud. The experiments further confirm that the novel datasets we propose are successful in enabling this. We also find that, for the majority of the affects, both signals need to be congruent to create a stronger impact. Based on this data, we implement a prototype that allows users to specify a desired affect and recommends congruent fonts and color palettes for the word.

A distributed data exchange engine for polystores (2020)

Kaitoua, Abdulrahman ; Rabl, Tilmann ; Markl, Volker

There is an increasing interest in fusing data from heterogeneous sources. Combining data sources increases the utility of existing datasets, generating new information and creating services of higher quality. A central issue in working with heterogeneous sources is data migration: In order to share and process data in different engines, resource intensive and complex movements and transformations between computing engines, services, and stores are necessary. Muses is a distributed, high-performance data migration engine that is able to interconnect distributed data stores by forwarding, transforming, repartitioning, or broadcasting data among distributed engines' instances in a resource-, cost-, and performance-adaptive manner. As such, it performs seamless information sharing across all participating resources in a standard, modular manner. We show an overall improvement of 30 % for pipelining jobs across multiple engines, even when we count the overhead of Muses in the execution time. This performance gain implies that Muses can be used to optimise large pipelines that leverage multiple engines.

Quantifying TPC-H choke points and their optimizations (2020)

Dreseler, Markus ; Boissier, Martin ; Rabl, Tilmann ; Uflacker, Matthias

TPC-H continues to be the most widely used benchmark for relational OLAP systems. It poses a number of challenges, also known as "choke points", which database systems have to solve in order to achieve good benchmark results. Examples include joins across multiple tables, correlated subqueries, and correlations within the TPC-H data set. Knowing the impact of such optimizations helps in developing optimizers as well as in interpreting TPC-H results across database systems. This paper provides a systematic analysis of choke points and their optimizations. It complements previous work on TPC-H choke points by providing a quantitative discussion of their relevance. It focuses on eleven choke points where the optimizations are beneficial independently of the database system. Of these, the flattening of subqueries and the placement of predicates have the biggest impact. Three queries (Q2, Q17, and Q21) are strongly ifluenced by the choice of an efficient query plan; three others (Q1, Q13, and Q18) are less influenced by plan optimizations and more dependent on an efficient execution engine.

The stable marriage problem with ties and restricted edges (2020)

Cseh, Agnes ; Heeger, Klaus

In the stable marriage problem, a set of men and a set of women are given, each of whom has a strictly ordered preference list over the acceptable agents in the opposite class. A matching is called stable if it is not blocked by any pair of agents, who mutually prefer each other to their respective partner. Ties in the preferences allow for three different definitions for a stable matching: weak, strong and super-stability. Besides this, acceptable pairs in the instance can be restricted in their ability of blocking a matching or being part of it, which again generates three categories of restrictions on acceptable pairs. Forced pairs must be in a stable matching, forbidden pairs must not appear in it, and lastly, free pairs cannot block any matching. Our computational complexity study targets the existence of a stable solution for each of the three stability definitions, in the presence of each of the three types of restricted pairs. We solve all cases that were still open. As a byproduct, we also derive that the maximum size weakly stable matching problem is hard even in very dense graphs, which may be of independent interest.

The complexity of cake cutting with unequal shares (2020)

Cseh, Ágnes ; Fleiner, Tamas

An unceasing problem of our prevailing society is the fair division of goods. The problem of proportional cake cutting focuses on dividing a heterogeneous and divisible resource, the cake, among n players who value pieces according to their own measure function. The goal is to assign each player a not necessarily connected part of the cake that the player evaluates at least as much as her proportional share. In this article, we investigate the problem of proportional division with unequal shares, where each player is entitled to receive a predetermined portion of the cake. Our main contribution is threefold. First we present a protocol for integer demands, which delivers a proportional solution in fewer queries than all known protocols. By giving a matching lower bound, we then show that our protocol is asymptotically the fastest possible. Finally, we turn to irrational demands and solve the proportional cake cutting problem by reducing it to the same problem with integer demands only. All results remain valid in a highly general cake cutting model, which can be of independent interest.

Machine learning to predict mortality and critical events in a cohort of patients with COVID-19 in New York City: model development and validation (2020)

Vaid, Akhil ; Somani, Sulaiman ; Russak, Adam J. ; De Freitas, Jessica K. ; Chaudhry, Fayzan F. ; Paranjpe, Ishan ; Johnson, Kipp W. ; Lee, Samuel J. ; Miotto, Riccardo ; Richter, Felix ; Zhao, Shan ; Beckmann, Noam D. ; Naik, Nidhi ; Kia, Arash ; Timsina, Prem ; Lala, Anuradha ; Paranjpe, Manish ; Golden, Eddye ; Danieletto, Matteo ; Singh, Manbir ; Meyer, Dara ; O'Reilly, Paul F. ; Huckins, Laura ; Kovatch, Patricia ; Finkelstein, Joseph ; Freeman, Robert M. ; Argulian, Edgar ; Kasarskis, Andrew ; Percha, Bethany ; Aberg, Judith A. ; Bagiella, Emilia ; Horowitz, Carol R. ; Murphy, Barbara ; Nestler, Eric J. ; Schadt, Eric E. ; Cho, Judy H. ; Cordon-Cardo, Carlos ; Fuster, Valentin ; Charney, Dennis S. ; Reich, David L. ; Böttinger, Erwin ; Levin, Matthew A. ; Narula, Jagat ; Fayad, Zahi A. ; Just, Allan C. ; Charney, Alexander W. ; Nadkarni, Girish N. ; Glicksberg, Benjamin S.

Background: COVID-19 has infected millions of people worldwide and is responsible for several hundred thousand fatalities. The COVID-19 pandemic has necessitated thoughtful resource allocation and early identification of high-risk patients. However, effective methods to meet these needs are lacking. Objective: The aims of this study were to analyze the electronic health records (EHRs) of patients who tested positive for COVID-19 and were admitted to hospitals in the Mount Sinai Health System in New York City; to develop machine learning models for making predictions about the hospital course of the patients over clinically meaningful time horizons based on patient characteristics at admission; and to assess the performance of these models at multiple hospitals and time points. Methods: We used Extreme Gradient Boosting (XGBoost) and baseline comparator models to predict in-hospital mortality and critical events at time windows of 3, 5, 7, and 10 days from admission. Our study population included harmonized EHR data from five hospitals in New York City for 4098 COVID-19-positive patients admitted from March 15 to May 22, 2020. The models were first trained on patients from a single hospital (n=1514) before or on May 1, externally validated on patients from four other hospitals (n=2201) before or on May 1, and prospectively validated on all patients after May 1 (n=383). Finally, we established model interpretability to identify and rank variables that drive model predictions. Results: Upon cross-validation, the XGBoost classifier outperformed baseline models, with an area under the receiver operating characteristic curve (AUC-ROC) for mortality of 0.89 at 3 days, 0.85 at 5 and 7 days, and 0.84 at 10 days. XGBoost also performed well for critical event prediction, with an AUC-ROC of 0.80 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. In external validation, XGBoost achieved an AUC-ROC of 0.88 at 3 days, 0.86 at 5 days, 0.86 at 7 days, and 0.84 at 10 days for mortality prediction. Similarly, the unimputed XGBoost model achieved an AUC-ROC of 0.78 at 3 days, 0.79 at 5 days, 0.80 at 7 days, and 0.81 at 10 days. Trends in performance on prospective validation sets were similar. At 7 days, acute kidney injury on admission, elevated LDH, tachypnea, and hyperglycemia were the strongest drivers of critical event prediction, while higher age, anion gap, and C-reactive protein were the strongest drivers of mortality prediction. Conclusions: We externally and prospectively trained and validated machine learning models for mortality and critical events for patients with COVID-19 at different time horizons. These models identified at-risk patients and uncovered underlying relationships that predicted outcomes.

Coronavirus 2019 and people living with human immunodeficiency virus (2020)

Background: There are limited data regarding the clinical impact of coronavirus disease 2019 (COVID-19) on people living with human immunodeficiency virus (PLWH). In this study, we compared outcomes for PLWH with COVID-19 to a matched comparison group. Methods: We identified 88 PLWH hospitalized with laboratory-confirmed COVID-19 in our hospital system in New York City between 12 March and 23 April 2020. We collected data on baseline clinical characteristics, laboratory values, HIV status, treatment, and outcomes from this group and matched comparators (1 PLWH to up to 5 patients by age, sex, race/ethnicity, and calendar week of infection). We compared clinical characteristics and outcomes (death, mechanical ventilation, hospital discharge) for these groups, as well as cumulative incidence of death by HIV status. Results: Patients did not differ significantly by HIV status by age, sex, or race/ethnicity due to the matching algorithm. PLWH hospitalized with COVID-19 had high proportions of HIV virologic control on antiretroviral therapy. PLWH had greater proportions of smoking (P < .001) and comorbid illness than uninfected comparators. There was no difference in COVID-19 severity on admission by HIV status (P = .15). Poor outcomes for hospitalized PLWH were frequent but similar to proportions in comparators; 18% required mechanical ventilation and 21% died during follow-up (compared with 23% and 20%, respectively). There was similar cumulative incidence of death over time by HIV status (P = .94). Conclusions: We found no differences in adverse outcomes associated with HIV infection for hospitalized COVID-19 patients compared with a demographically similar patient group.

Distributed detection of sequential anomalies in univariate time series (2021)

Schneider, Johannes ; Wenig, Phillip ; Papenbrock, Thorsten

The automated detection of sequential anomalies in time series is an essential task for many applications, such as the monitoring of technical systems, fraud detection in high-frequency trading, or the early detection of disease symptoms. All these applications require the detection to find all sequential anomalies possibly fast on potentially very large time series. In other words, the detection needs to be effective, efficient and scalable w.r.t. the input size. Series2Graph is an effective solution based on graph embeddings that are robust against re-occurring anomalies and can discover sequential anomalies of arbitrary length and works without training data. Yet, Series2Graph is no t scalable due to its single-threaded approach; it cannot, in particular, process arbitrarily large sequences due to the memory constraints of a single machine. In this paper, we propose our distributed anomaly detection system, short DADS, which is an efficient and scalable adaptation of Series2Graph. Based on the actor programming model, DADS distributes the input time sequence, intermediate state and the computation to all processors of a cluster in a way that minimizes communication costs and synchronization barriers. Our evaluation shows that DADS is orders of magnitude faster than S2G, scales almost linearly with the number of processors in the cluster and can process much larger input sequences due to its scale-out property.

Transformation rules with nested application conditions (2021)

Lambers, Leen ; Orejas, Fernando

Recently, initial conflicts were introduced in the framework of M-adhesive categories as an important optimization of critical pairs. In particular, they represent a proper subset such that each conflict is represented in a minimal context by a unique initial one. The theory of critical pairs has been extended in the framework of M-adhesive categories to rules with nested application conditions (ACs), restricting the applicability of a rule and generalizing the well-known negative application conditions. A notion of initial conflicts for rules with ACs does not exist yet. In this paper, on the one hand, we extend the theory of initial conflicts in the framework of M-adhesive categories to transformation rules with ACs. They represent a proper subset again of critical pairs for rules with ACs, and represent each conflict in a minimal context uniquely. They are moreover symbolic because we can show that in general no finite and complete set of conflicts for rules with ACs exists. On the other hand, we show that critical pairs are minimally M-complete, whereas initial conflicts are minimally complete. Finally, we introduce important special cases of rules with ACs for which we can obtain finite, minimally (M-)complete sets of conflicts.

Role of individual motivations and privacy concerns in the adoption of German electronic patient record apps (2021)

Henkenjohann, Richard

Germany's electronic patient record ("ePA") launched in 2021 with several attempts and years of delay. The development of such a large-scale project is a complex task, and so is its adoption. Individual attitudes towards an electronic health record are crucial, as individuals can reject opting-in to it and making any national efforts unachievable. Although the integration of an electronic health record serves potential benefits, it also constitutes risks for an individual's privacy. With a mixed-methods study design, this work provides evidence that different types of motivations and contextual privacy antecedents affect usage intentions towards the ePA. Most significantly, individual motivations stemming from feelings of volition or external mandates positively affect ePA adoption, although internal incentives are more powerful.

Discovering relaxed functional dependencies based on multi-attribute dominance (2021)

Caruccio, Loredana ; Deufemia, Vincenzo ; Naumann, Felix ; Polese, Giuseppe

With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.

Data dependencies for query optimization (2021)

Koßmann, Jan ; Papenbrock, Thorsten ; Naumann, Felix

Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.

Evolutionary algorithms and submodular functions (2021)

Quinzan, Francesco ; Göbel, Andreas ; Wagner, Markus ; Friedrich, Tobias

A core operator of evolutionary algorithms (EAs) is the mutation. Recently, much attention has been devoted to the study of mutation operators with dynamic and non-uniform mutation rates. Following up on this area of work, we propose a new mutation operator and analyze its performance on the (1 + 1) Evolutionary Algorithm (EA). Our analyses show that this mutation operator competes with pre-existing ones, when used by the (1 + 1) EA on classes of problems for which results on the other mutation operators are available. We show that the (1 + 1) EA using our mutation operator finds a (1/3)-approximation ratio on any non-negative submodular function in polynomial time. We also consider the problem of maximizing a symmetric submodular function under a single matroid constraint and show that the (1 + 1) EA using our operator finds a (1/3)-approximation within polynomial time. This performance matches that of combinatorial local search algorithms specifically designed to solve these problems and outperforms them with constant probability. Finally, we evaluate the performance of the (1 + 1) EA using our operator experimentally by considering two applications: (a) the maximum directed cut problem on real-world graphs of different origins, with up to 6.6 million vertices and 56 million edges and (b) the symmetric mutual information problem using a four month period air pollution data set. In comparison with uniform mutation and a recently proposed dynamic scheme, our operator comes out on top on these instances.

Cyber security threats to bitcoin exchanges (2021)

Oosthoek, Kris ; Dörr, Christian

Bitcoin is gaining traction as an alternative store of value. Its market capitalization transcends all other cryptocurrencies in the market. But its high monetary value also makes it an attractive target to cyber criminal actors. Hacking campaigns usually target an ecosystem's weakest points. In Bitcoin, the exchange platforms are one of them. Each exchange breach is a threat not only to direct victims, but to the credibility of Bitcoin's entire ecosystem. Based on an extensive analysis of 36 breaches of Bitcoin exchanges, we show the attack patterns used to exploit Bitcoin exchange platforms using an industry standard for reporting intelligence on cyber security breaches. Based on this we are able to provide an overview of the most common attack vectors, showing that all except three hacks were possible due to relatively lax security. We show that while the security regimen of Bitcoin exchanges is subpar compared to other financial service providers, the use of stolen credentials, which does not require any hacking, is decreasing. We also show that the amount of BTC taken during a breach is decreasing, as well as the exchanges that terminate after being breached. Furthermore we show that overall security posture has improved, but still has major flaws. To discover adversarial methods post-breach, we have analyzed two cases of BTC laundering. Through this analysis we provide insight into how exchange platforms with lax cyber security even further increase the intermediary risk introduced by them into the Bitcoin ecosystem.

A logic-based incremental approach to graph repair featuring delta preservation (2021)

Schneider, Sven ; Lambers, Leen ; Orejas, Fernando

We introduce a logic-based incremental approach to graph repair, generating a sound and complete (upon termination) overview of least-changing graph repairs from which a user may select a graph repair based on non-formalized further requirements. This incremental approach features delta preservation as it allows to restrict the generation of graph repairs to delta-preserving graph repairs, which do not revert the additions and deletions of the most recent consistency-violating graph update. We specify consistency of graphs using the logic of nested graph conditions, which is equivalent to first-order logic on graphs. Technically, the incremental approach encodes if and how the graph under repair satisfies a graph condition using the novel data structure of satisfaction trees, which are adapted incrementally according to the graph updates applied. In addition to the incremental approach, we also present two state-based graph repair algorithms, which restore consistency of a graph independent of the most recent graph update and which generate additional graph repairs using a global perspective on the graph under repair. We evaluate the developed algorithms using our prototypical implementation in the tool AutoGraph and illustrate our incremental approach using a case study from the graph database domain.

Counting homomorphisms to trees modulo a prime (2021)

Göbel, Andreas ; Lagodzinski, Gregor J. A. ; Seidel, Karen

Many important graph-theoretic notions can be encoded as counting graph homomorphism problems, such as partition functions in statistical physics, in particular independent sets and colourings. In this article, we study the complexity of #(p) HOMSTOH, the problem of counting graph homomorphisms from an input graph to a graph H modulo a prime number p. Dyer and Greenhill proved a dichotomy stating that the tractability of non-modular counting graph homomorphisms depends on the structure of the target graph. Many intractable cases in non-modular counting become tractable in modular counting due to the common phenomenon of cancellation. In subsequent studies on counting modulo 2, however, the influence of the structure of H on the tractability was shown to persist, which yields similar dichotomies. Our main result states that for every tree H and every prime p the problem #pHOMSTOH is either polynomial time computable or #P-p-complete. This relates to the conjecture of Faben and Jerrum stating that this dichotomy holds for every graph H when counting modulo 2. In contrast to previous results on modular counting, the tractable cases of #pHOMSTOH are essentially the same for all values of the modulo when H is a tree. To prove this result, we study the structural properties of a homomorphism. As an important interim result, our study yields a dichotomy for the problem of counting weighted independent sets in a bipartite graph modulo some prime p. These results are the first suggesting that such dichotomies hold not only for the modulo 2 case but also for the modular counting functions of all primes p.

A simplified run time analysis of the univariate marginal distribution algorithm on LeadingOnes (2021)

Doerr, Benjamin ; Krejca, Martin Stefan

With elementary means, we prove a stronger run time guarantee for the univariate marginal distribution algorithm (UMDA) optimizing the LEADINGONES benchmark function in the desirable regime with low genetic drift. If the population size is at least quasilinear, then, with high probability, the UMDA samples the optimum in a number of iterations that is linear in the problem size divided by the logarithm of the UMDA's selection rate. This improves over the previous guarantee, obtained by Dang and Lehre (2015) via the deep level-based population method, both in terms of the run time and by demonstrating further run time gains from small selection rates. Under similar assumptions, we prove a lower bound that matches our upper bound up to constant factors.

Formal framework for checking compliance of data-driven case management (2021)

Haarmann, Stephan ; Holfter, Adrian ; Pufahl, Luise ; Weske, Mathias

Business processes are often specified in descriptive or normative models. Both types of models should adhere to internal and external regulations, such as company guidelines or laws. Employing compliance checking techniques, it is possible to verify process models against rules. While traditionally compliance checking focuses on well-structured processes, we address case management scenarios. In case management, knowledge workers drive multi-variant and adaptive processes. Our contribution is based on the fragment-based case management approach, which splits a process into a set of fragments. The fragments are synchronized through shared data but can, otherwise, be dynamically instantiated and executed. We formalize case models using Petri nets. We demonstrate the formalization for design-time and run-time compliance checking and present a proof-of-concept implementation. The application of the implemented compliance checking approach to a use case exemplifies its effectiveness while designing a case model. The empirical evaluation on a set of case models for measuring the performance of the approach shows that rules can often be checked in less than a second.

Scientific sexism (2021)

Oliveira-Ciabati, Livia ; Loures dos Santos, Luciane ; Hsiou Schmaltz, Annie ; Sasso, Ariane Morassi ; Castro, Margaret de ; Souza, João Paulo

OBJECTIVE: To investigate gender inequity in the scientific production of the University of Sao Paulo. METHODS: Members of the University of Sao Paulo faculty are the study population. The Web of Science repository was the source of the publication metrics. We selected the measures: total publications and citations, average of citations per year and item, H-index, and history of citations between 1950 and 2019. We used the name of the faculty member as a proxy to the gender identity. We use descriptive statistics to characterize the metrics. We evaluated the scissors effect by selecting faculty members with a high H-index. The historical series of citations was projected until 2100. We carry out analyses for the general population and working time subgroups: less than 10 years, 10 to 20 years, and 20 years or more. RESULTS: Of the 8,325 faculty members, we included 3,067 (36.8%). Among those included, 1,893 (61.7%) were male and 1,174 (38.28%) female. The male gender presented higher values in the publication metrics (average of articles: M = 67.0 versus F = 49.7; average of citations/year: M = 53.9 versus F = 35.9), and H-index (M = 14.5 versus F = 12.4). Among the 100 individuals with the highest H-index (>= 37), 83% are male. The male curve grows faster in the historical series of citations, opening a difference between the groups whose separation is confirmed by the projection. DISCUSSION: Scientific production at the Universidade de Sao Paulo is subject to a gender bias. Two-thirds of the faculty are male, and hiring over the past few decades perpetuates this pattern. The large majority of high impact faculty members are male. CONCLUSION: Our analysis suggests that the Universidade de Sao Paulo will not overcome gender inequality in scientific production without substantive affirmative action. Development does not happen by chance but through choices that are affirmative, decisive, and long-term oriented.

Interactive photo editing on smartphones via intrinsic decomposition (2021)

Shekhar, Sumit ; Reimann, Max ; Mayer, Maximilian ; Semmo, Amir ; Pasewaldt, Sebastian ; Döllner, Jürgen ; Trapp, Matthias

Intrinsic decomposition refers to the problem of estimating scene characteristics, such as albedo and shading, when one view or multiple views of a scene are provided. The inverse problem setting, where multiple unknowns are solved given a single known pixel-value, is highly under-constrained. When provided with correlating image and depth data, intrinsic scene decomposition can be facilitated using depth-based priors, which nowadays is easy to acquire with high-end smartphones by utilizing their depth sensors. In this work, we present a system for intrinsic decomposition of RGB-D images on smartphones and the algorithmic as well as design choices therein. Unlike state-of-the-art methods that assume only diffuse reflectance, we consider both diffuse and specular pixels. For this purpose, we present a novel specularity extraction algorithm based on a multi-scale intensity decomposition and chroma inpainting. At this, the diffuse component is further decomposed into albedo and shading components. We use an inertial proximal algorithm for non-convex optimization (iPiano) to ensure albedo sparsity. Our GPU-based visual processing is implemented on iOS via the Metal API and enables interactive performance on an iPhone 11 Pro. Further, a qualitative evaluation shows that we are able to obtain high-quality outputs. Furthermore, our proposed approach for specularity removal outperforms state-of-the-art approaches for real-world images, while our albedo and shading layer decomposition is faster than the prior work at a comparable output quality. Manifold applications such as recoloring, retexturing, relighting, appearance editing, and stylization are shown, each using the intrinsic layers obtained with our method and/or the corresponding depth data.

A systematic review of diagnostic accuracy and clinical applications of wearable movement sensors for knee joint rehabilitation (2021)

Prill, Robert ; Walter, Marina ; Królikowska, Aleksandra ; Becker, Roland

In clinical practice, only a few reliable measurement instruments are available for monitoring knee joint rehabilitation. Advances to replace motion capturing with sensor data measurement have been made in the last years. Thus, a systematic review of the literature was performed, focusing on the implementation, diagnostic accuracy, and facilitators and barriers of integrating wearable sensor technology in clinical practices based on a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. For critical appraisal, the COSMIN Risk of Bias tool for reliability and measurement of error was used. PUBMED, Prospero, Cochrane database, and EMBASE were searched for eligible studies. Six studies reporting reliability aspects in using wearable sensor technology at any point after knee surgery in humans were included. All studies reported excellent results with high reliability coefficients, high limits of agreement, or a few detectable errors. They used different or partly inappropriate methods for estimating reliability or missed reporting essential information. Therefore, a moderate risk of bias must be considered. Further quality criterion studies in clinical settings are needed to synthesize the evidence for providing transparent recommendations for the clinical use of wearable movement sensors in knee joint rehabilitation.

Robust and budget-constrained encoding configurations for in-memory database systems (2021)

Boissier, Martin

Data encoding has been applied to database systems for decades as it mitigates bandwidth bottlenecks and reduces storage requirements. But even in the presence of these advantages, most in-memory database systems use data encoding only conservatively as the negative impact on runtime performance can be severe. Real-world systems with large parts being infrequently accessed and cost efficiency constraints in cloud environments require solutions that automatically and efficiently select encoding techniques, including heavy-weight compression. In this paper, we introduce workload-driven approaches to automaticaly determine memory budget-constrained encoding configurations using greedy heuristics and linear programming. We show for TPC-H, TPC-DS, and the Join Order Benchmark that optimized encoding configurations can reduce the main memory footprint significantly without a loss in runtime performance over state-of-the-art dictionary encoding. To yield robust selections, we extend the linear programming-based approach to incorporate query runtime constraints and mitigate unexpected performance regressions.

Detecting layout templates in complex multiregion files (2021)

Vitagliano, Gerardo ; Jiang, Lan ; Naumann, Felix

Spreadsheets are among the most commonly used file formats for data management, distribution, and analysis. Their widespread employment makes it easy to gather large collections of data, but their flexible canvas-based structure makes automated analysis difficult without heavy preparation. One of the common problems that practitioners face is the presence of multiple, independent regions in a single spreadsheet, possibly separated by repeated empty cells. We define such files as "multiregion" files. In collections of various spreadsheets, we can observe that some share the same layout. We present the Mondrian approach to automatically identify layout templates across multiple files and systematically extract the corresponding regions. Our approach is composed of three phases: first, each file is rendered as an image and inspected for elements that could form regions; then, using a clustering algorithm, the identified elements are grouped to form regions; finally, every file layout is represented as a graph and compared with others to find layout templates. We compare our method to state-of-the-art table recognition algorithms on two corpora of real-world enterprise spreadsheets. Our approach shows the best performances in detecting reliable region boundaries within each file and can correctly identify recurring layouts across files.

‘What will we learn from the current crisis?’ (2021)

Björk, Jennie ; Hölzle, Katharina ; Boer, Harry

The complexity of dependency detection and discovery in relational databases (2021)

Blaesius, Thomas ; Friedrich, Tobias ; Schirneck, Friedrich Martin

Multi-column dependencies in relational databases come associated with two different computational tasks. The detection problem is to decide whether a dependency of a certain type and size holds in a given database, the discovery problem asks to enumerate all valid dependencies of that type. We settle the complexity of both of these problems for unique column combinations (UCCs), functional dependencies (FDs), and inclusion dependencies (INDs). We show that the detection of UCCs and FDs is W[2]-complete when parameterized by the solution size. The discovery of inclusion-wise minimal UCCs is proven to be equivalent under parsimonious reductions to the transversal hypergraph problem of enumerating the minimal hitting sets of a hypergraph. The discovery of FDs is equivalent to the simultaneous enumeration of the hitting sets of multiple input hypergraphs. We further identify the detection of INDs as one of the first natural W[3]-complete problems. The discovery of maximal INDs is shown to be equivalent to enumerating the maximal satisfying assignments of antimonotone, 3-normalized Boolean formulas.

ATIB (2021)

Grüner, Andreas ; Mühle, Alexander ; Meinel, Christoph

Identity management is a principle component of securing online services. In the advancement of traditional identity management patterns, the identity provider remained a Trusted Third Party (TTP). The service provider and the user need to trust a particular identity provider for correct attributes amongst other demands. This paradigm changed with the invention of blockchain-based Self-Sovereign Identity (SSI) solutions that primarily focus on the users. SSI reduces the functional scope of the identity provider to an attribute provider while enabling attribute aggregation. Besides that, the development of new protocols, disregarding established protocols and a significantly fragmented landscape of SSI solutions pose considerable challenges for an adoption by service providers. We propose an Attribute Trust-enhancing Identity Broker (ATIB) to leverage the potential of SSI for trust-enhancing attribute aggregation. Furthermore, ATIB abstracts from a dedicated SSI solution and offers standard protocols. Therefore, it facilitates the adoption by service providers. Despite the brokered integration approach, we show that ATIB provides a high security posture. Additionally, ATIB does not compromise the ten foundational SSI principles for the users.

Integrative biomarker detection on high-dimensional gene expression data sets (2021)

Perscheid, Cindy

Gene expression data provide the expression levels of tens of thousands of genes from several hundred samples. These data are analyzed to detect biomarkers that can be of prognostic or diagnostic use. Traditionally, biomarker detection for gene expression data is the task of gene selection. The vast number of genes is reduced to a few relevant ones that achieve the best performance for the respective use case. Traditional approaches select genes based on their statistical significance in the data set. This results in issues of robustness, redundancy and true biological relevance of the selected genes. Integrative analyses typically address these shortcomings by integrating multiple data artifacts from the same objects, e.g. gene expression and methylation data. When only gene expression data are available, integrative analyses instead use curated information on biological processes from public knowledge bases. With knowledge bases providing an ever-increasing amount of curated biological knowledge, such prior knowledge approaches become more powerful. This paper provides a thorough overview on the status quo of biomarker detection on gene expression data with prior biological knowledge. We discuss current shortcomings of traditional approaches, review recent external knowledge bases, provide a classification and qualitative comparison of existing prior knowledge approaches and discuss open challenges for this kind of gene selection.

Knowledge transfer for entity resolution with siamese neural networks (2021)

Loster, Michael ; Koumarelas, Ioannis ; Naumann, Felix

The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity-duplicates-into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.

Knowledge bases and software support for variant interpretation in precision oncology (2021)

Borchert, Florian ; Mock, Andreas ; Tomczak, Aurelie ; Hügel, Jonas ; Alkarkoukly, Samer ; Knurr, Alexander ; Volckmar, Anna-Lena ; Stenzinger, Albrecht ; Schirmacher, Peter ; Debus, Jürgen ; Jäger, Dirk ; Longerich, Thomas ; Fröhling, Stefan ; Eils, Roland ; Bougatf, Nina ; Sax, Ulrich ; Schapranow, Matthieu-Patrick

Precision oncology is a rapidly evolving interdisciplinary medical specialty. Comprehensive cancer panels are becoming increasingly available at pathology departments worldwide, creating the urgent need for scalable cancer variant annotation and molecularly informed treatment recommendations. A wealth of mainly academia-driven knowledge bases calls for software tools supporting the multi-step diagnostic process. We derive a comprehensive list of knowledge bases relevant for variant interpretation by a review of existing literature followed by a survey among medical experts from university hospitals in Germany. In addition, we review cancer variant interpretation tools, which integrate multiple knowledge bases. We categorize the knowledge bases along the diagnostic process in precision oncology and analyze programmatic access options as well as the integration of knowledge bases into software tools. The most commonly used knowledge bases provide good programmatic access options and have been integrated into a range of software tools. For the wider set of knowledge bases, access options vary across different parts of the diagnostic process. Programmatic access is limited for information regarding clinical classifications of variants and for therapy recommendations. The main issue for databases used for biological classification of pathogenic variants and pathway context information is the lack of standardized interfaces. There is no single cancer variant interpretation tool that integrates all identified knowledge bases. Specialized tools are available and need to be further developed for different steps in the diagnostic process.

Eatomics (2021)

Kraus, Sara Milena ; Mathew-Stephen, Mariet ; Schapranow, Matthieu-Patrick

Quantitative proteomics data are becoming increasingly more available, and as a consequence are being analyzed and interpreted by a larger group of users. However, many of these users have less programming experience. Furthermore, experimental designs and setups are getting more complicated, especially when tissue biopsies are analyzed. Luckily, the proteomics community has already established some best practices on how to conduct quality control, differential abundance analysis and enrichment analysis. However, an easy-to-use application that wraps together all steps for the exploration and flexible analysis of quantitative proteomics data is not yet available. For Eatomics, we utilize the R Shiny framework to implement carefully chosen parts of established analysis workflows to (i) make them accessible in a user-friendly way, (ii) add a multitude of interactive exploration possibilities, and (iii) develop a unique experimental design setup module, which interactively translates a given research hypothesis into a differential abundance and enrichment analysis formula. In this, we aim to fulfill the needs of a growing group of inexperienced quantitative proteomics data analysts. Eatomics may be tested with demo data directly online via https://we.analyzegenomes.com/now/eatomics/or with the user's own data by installation from the Github repository at https://github.com/Millchmaedchen/Eatomics.

Correction to: Knowledge bases and software support for variant interpretation in precision oncology (2021)

Borchert, Florian ; Mock, Andreas ; Tomczak, Aurelie ; Hügel, Jonas ; Alkarkoukly, Samer ; Knurr, Alexander ; Volckmar, Anna-Lena ; Stenzinger, Albrecht ; Schirmacher, Peter ; Debus, Jürgen ; Jäger, Dirk ; Longerich, Thomas ; Fröhling, Stefan ; Eils, Roland ; Bougatf, Nina ; Sax, Ulrich ; Schapranow, Matthieu-Patrick

Predictive approaches for acute dialysis requirement and death in COVID-19 (2021)

Vaid, Akhil ; Chan, Lili ; Chaudhary, Kumardeep ; Jaladanki, Suraj K. ; Paranjpe, Ishan ; Russak, Adam J. ; Kia, Arash ; Timsina, Prem ; Levin, Matthew A. ; He, John Cijiang ; Böttinger, Erwin ; Charney, Alexander W. ; Fayad, Zahi A. ; Coca, Steven G. ; Glicksberg, Benjamin S. ; Nadkarni, Girish N.

Background and objectives AKI treated with dialysis initiation is a common complication of coronavirus disease 2019 (COVID-19) among hospitalized patients. However, dialysis supplies and personnel are often limited. Design, setting, participants, & measurements Using data from adult patients hospitalized with COVID-19 from five hospitals from theMount Sinai Health System who were admitted between March 10 and December 26, 2020, we developed and validated several models (logistic regression, Least Absolute Shrinkage and Selection Operator (LASSO), random forest, and eXtreme GradientBoosting [XGBoost; with and without imputation]) for predicting treatment with dialysis or death at various time horizons (1, 3, 5, and 7 days) after hospital admission. Patients admitted to theMount Sinai Hospital were used for internal validation, whereas the other hospitals formed part of the external validation cohort. Features included demographics, comorbidities, and laboratory and vital signs within 12 hours of hospital admission. Results A total of 6093 patients (2442 in training and 3651 in external validation) were included in the final cohort. Of the different modeling approaches used, XGBoost without imputation had the highest area under the receiver operating characteristic (AUROC) curve on internal validation (range of 0.93-0.98) and area under the precisionrecall curve (AUPRC; range of 0.78-0.82) for all time points. XGBoost without imputation also had the highest test parameters on external validation (AUROC range of 0.85-0.87, and AUPRC range of 0.27-0.54) across all time windows. XGBoost without imputation outperformed all models with higher precision and recall (mean difference in AUROC of 0.04; mean difference in AUPRC of 0.15). Features of creatinine, BUN, and red cell distribution width were major drivers of the model's prediction. Conclusions An XGBoost model without imputation for prediction of a composite outcome of either death or dialysis in patients positive for COVID-19 had the best performance, as compared with standard and other machine learning models.

Acute kidney injury in patients hospitalized with COVID-19 in New York City (2021)

Dellepiane, Sergio ; Vaid, Akhil ; Jaladanki, Suraj K. ; Coca, Steven ; Fayad, Zahi A. ; Charney, Alexander W. ; Böttinger, Erwin ; He, John Cijiang ; Glicksberg, Benjamin S. ; Chan, Lili ; Nadkarni, Girish

FIBER (2021)

Datta, Suparno ; Sachs, Jan Philipp ; Freitas da Cruz, Harry ; Martensen, Tom ; Bode, Philipp ; Morassi Sasso, Ariane ; Glicksberg, Benjamin S. ; Böttinger, Erwin

Objectives: The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. Materials and Methods: FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER's capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. Results: Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case. Conclusion: FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.

Phe2vec (2021)

De Freitas, Jessica K. ; Johnson, Kipp W. ; Golden, Eddye ; Nadkarni, Girish N. ; Dudley, Joel T. ; Böttinger, Erwin ; Glicksberg, Benjamin S. ; Miotto, Riccardo

Robust phenotyping of patients from electronic health records (EHRs) at scale is a challenge in clinical informatics. Here, we introduce Phe2vec, an automated framework for disease phenotyping from EHRs based on unsupervised learning and assess its effectiveness against standard rule-based algorithms from Phenotype KnowledgeBase (PheKB). Phe2vec is based on pre-computing embeddings of medical concepts and patients' clinical history. Disease phenotypes are then derived from a seed concept and its neighbors in the embedding space. Patients are linked to a disease if their embedded representation is close to the disease phenotype. Comparing Phe2vec and PheKB cohorts head-to-head using chart review, Phe2vec performed on par or better in nine out of ten diseases. Differently from other approaches, it can scale to any condition and was validated against widely adopted expert-based standards. Phe2vec aims to optimize clinical informatics research by augmenting current frameworks to characterize patients by condition and derive reliable disease cohorts.

Certainty in QRS detection with artificial neural networks (2021)

Chromik, Jonas ; Pirl, Lukas ; Beilharz, Jossekin Jakob ; Arnrich, Bert ; Polze, Andreas

Detection of the QRS complex is a long-standing topic in the context of electrocardiography and many algorithms build upon the knowledge of the QRS positions. Although the first solutions to this problem were proposed in the 1970s and 1980s, there is still potential for improvements. Advancements in neural network technology made in recent years also lead to the emergence of enhanced QRS detectors based on artificial neural networks. In this work, we propose a method for assessing the certainty that is in each of the detected QRS complexes, i.e. how confident the QRS detector is that there is, in fact, a QRS complex in the position where it was detected. We further show how this metric can be utilised to distinguish correctly detected QRS complexes from false detections.

Causal inference in developmental medicine and neurology (2021)

Konigorski, Stefan

Interaction-based feature selection algorithm outperforms polygenic risk score in predicting Parkinson’s Disease status (2021)

Cope, Justin L. ; Baukmann, Hannes A. ; Klinger, Jörn E. ; Ravarani, Charles N. J. ; Böttinger, Erwin ; Konigorski, Stefan ; Schmidt, Marco F.

Polygenic risk scores (PRS) aggregating results from genome-wide association studies are the state of the art in the prediction of susceptibility to complex traits or diseases, yet their predictive performance is limited for various reasons, not least of which is their failure to incorporate the effects of gene-gene interactions. Novel machine learning algorithms that use large amounts of data promise to find gene-gene interactions in order to build models with better predictive performance than PRS. Here, we present a data preprocessing step by using data-mining of contextual information to reduce the number of features, enabling machine learning algorithms to identify gene-gene interactions. We applied our approach to the Parkinson's Progression Markers Initiative (PPMI) dataset, an observational clinical study of 471 genotyped subjects (368 cases and 152 controls). With an AUC of 0.85 (95% CI = [0.72; 0.96]), the interaction-based prediction model outperforms the PRS (AUC of 0.58 (95% CI = [0.42; 0.81])). Furthermore, feature importance analysis of the model provided insights into the mechanism of Parkinson's disease. For instance, the model revealed an interaction of previously described drug target candidate genes TMEM175 and GAPDHP25. These results demonstrate that interaction-based machine learning models can improve genetic prediction models and might provide an answer to the missing heritability problem.

Federated learning in a medical context (2021)

Pfitzner, Bjarne ; Steckhan, Nico ; Arnrich, Bert

Data privacy is a very important issue. Especially in fields like medicine, it is paramount to abide by the existing privacy regulations to preserve patients' anonymity. However, data is required for research and training machine learning models that could help gain insight into complex correlations or personalised treatments that may otherwise stay undiscovered. Those models generally scale with the amount of data available, but the current situation often prohibits building large databases across sites. So it would be beneficial to be able to combine similar or related data from different sites all over the world while still preserving data privacy. Federated learning has been proposed as a solution for this, because it relies on the sharing of machine learning models, instead of the raw data itself. That means private data never leaves the site or device it was collected on. Federated learning is an emerging research area, and many domains have been identified for the application of those methods. This systematic literature review provides an extensive look at the concept of and research into federated learning and its applicability for confidential healthcare datasets.

Outcomes of patients on maintenance dialysis hospitalized with COVID-19 (2021)

Chan, Lili ; Jaladanki, Suraj K. ; Somani, Sulaiman ; Paranjpe, Ishan ; Kumar, Arvind ; Zhao, Shan ; Kaufman, Lewis ; Leisman, Staci ; Sharma, Shuchita ; He, John Cijiang ; Murphy, Barbara ; Fayad, Zahi A. ; Levin, Matthew A. ; Böttinger, Erwin ; Charney, Alexander W. ; Glicksberg, Benjamin ; Coca, Steven G. ; Nadkarni, Girish N.

Refine

Has Fulltext

Author

Year of publication

Document Type

Language

Is part of the Bibliography

Keywords

Institute

226 search hits