Hasso-Plattner-Institut für Digital Engineering GmbH
Refine
Year of publication
Document Type
- Article (200)
- Doctoral Thesis (101)
- Other (84)
- Monograph/Edited Volume (42)
- Postprint (22)
- Conference Proceeding (4)
- Part of a Book (1)
- Habilitation Thesis (1)
- Report (1)
Keywords
- MOOC (42)
- digital education (37)
- e-learning (36)
- Digitale Bildung (34)
- online course creation (34)
- online course design (34)
- Kursdesign (33)
- Micro Degree (33)
- Online-Lehre (33)
- Onlinekurs (33)
Data preparation stands as a cornerstone in the landscape of data science workflows, commanding a significant portion—approximately 80%—of a data scientist's time. The extensive time consumption in data preparation is primarily attributed to the intricate challenge faced by data scientists in devising tailored solutions for downstream tasks. This complexity is further magnified by the inadequate availability of metadata, the often ad-hoc nature of preparation tasks, and the necessity for data scientists to grapple with a diverse range of sophisticated tools, each presenting its unique intricacies and demands for proficiency.
Previous research in data management has traditionally concentrated on preparing the content within columns and rows of a relational table, addressing tasks, such as string disambiguation, date standardization, or numeric value normalization, commonly referred to as data cleaning. This focus assumes a perfectly structured input table. Consequently, the mentioned data cleaning tasks can be effectively applied only after the table has been successfully loaded into the respective data cleaning environment, typically in the later stages of the data processing pipeline.
While current data cleaning tools are well-suited for relational tables, extensive data repositories frequently contain data stored in plain text files, such as CSV files, due to their adaptable standard. Consequently, these files often exhibit tables with a flexible layout of rows and columns, lacking a relational structure. This flexibility often results in data being distributed across cells in arbitrary positions, typically guided by user-specified formatting guidelines.
Effectively extracting and leveraging these tables in subsequent processing stages necessitates accurate parsing. This thesis emphasizes what we define as the “structure” of a data file—the fundamental characters within a file essential for parsing and comprehending its content. Concentrating on the initial stages of the data preprocessing pipeline, this thesis addresses two crucial aspects: comprehending the structural layout of a table within a raw data file and automatically identifying and rectifying any structural issues that might hinder its parsing. Although these issues may not directly impact the table's content, they pose significant challenges in parsing the table within the file.
Our initial contribution comprises an extensive survey of commercially available data preparation tools. This survey thoroughly examines their distinct features, the lacking features, and the necessity for preliminary data processing despite these tools. The primary goal is to elucidate the current state-of-the-art in data preparation systems while identifying areas for enhancement. Furthermore, the survey explores the encountered challenges in data preprocessing, emphasizing opportunities for future research and improvement.
Next, we propose a novel data preparation pipeline designed for detecting and correcting structural errors. The aim of this pipeline is to assist users at the initial preprocessing stage by ensuring the correct loading of their data into their preferred systems. Our approach begins by introducing SURAGH, an unsupervised system that utilizes a pattern-based method to identify dominant patterns within a file, independent of external information, such as data types, row structures, or schemata. By identifying deviations from the dominant pattern, it detects ill-formed rows. Subsequently, our structure correction system, TASHEEH, gathers the identified ill-formed rows along with dominant patterns and employs a novel pattern transformation algebra to automatically rectify errors. Our pipeline serves as an end-to-end solution, transforming a structurally broken CSV file into a well-formatted one, usually suitable for seamless loading.
Finally, we introduce MORPHER, a user-friendly GUI integrating the functionalities of both SURAGH and TASHEEH. This interface empowers users to access the pipeline's features through visual elements. Our extensive experiments demonstrate the effectiveness of our data preparation systems, requiring no user involvement. Both SURAGH and TASHEEH outperform existing state-of-the-art methods significantly in both precision and recall.
Background:
The current range of disease-modifying treatments (DMTs) for relapsing-remitting multiple sclerosis (RRMS) has placed more importance on the accurate monitoring of disease progression for timely and appropriate treatment decisions. With a rising number of measurements for disease progression, it is currently unclear how well these measurements or combinations of them can monitor more mildly affected RRMS patients.
Objectives:
To investigate several composite measures for monitoring disease activity and their potential relation to the biomarker neurofilament light chain (NfL) in a clearly defined early RRMS patient cohort with a milder disease course.
Methods:
From a total of 301 RRMS patients, a subset of 46 patients being treated with a continuous first-line therapy was analyzed for loss of no evidence of disease activity (lo-NEDA-3) status, relapse-associated worsening (RAW) and progression independent of relapse activity (PIRA), up to seven years after treatment initialization.
Kaplan-Meier estimates were used for time-to-event analysis. Additionally, a Cox regression model was used to analyze the effect of NIL levels on outcome measures in this cohort.
Results:
In this mildly affected cohort, both lo-NEDA-3 and PIRA frequently occurred over a median observational period of 67.2 months and were observed in 39 (84.8%) and 23 (50.0%) patients, respectively.
Additionally, 12 out of 26 PIRA manifestations (46.2%) were observed without a corresponding lo-NEDA-3 status. Jointly, either PIRA or lo-NEDA-3 showed disease activity in all patients followed-up for at least the median duration (67.2 months). NfL values demonstrated an association with the occurrence of relapses and RAW.
Conclusion:
The complementary use of different disease progression measures helps mirror ongoing disease activity in mildly affected early RRMS patients being treated with continuous first-line therapy.
Rapid innovation and proliferation of software as a medical device have accelerated the clinical use of digital technologies across a wide array of medical conditions.
Current regulatory pathways were developed for traditional (hardware) medical devices and offer a useful structure, but the evolution of digital devices requires concomitant innovation in regulatory approaches to maximize the potential benefits of these emerging technologies.
A number of specific adaptations could strengthen current regulatory oversight while promoting ongoing innovation.
An epilepsy diagnosis has large consequences for an individual but is often difficult to make in clinical practice.
Novel biomarkers are thus greatly needed. Here, we give an overview of how thousands of common genetic factors that increase the risk for epilepsy can be summarized as epilepsy polygenic risk scores (PRS).
We discuss the current state of research on how epilepsy PRS can serve as a biomarker for the risk for epilepsy. The high heritability of common forms of epilepsy, particularly genetic generalized epilepsy, indicates a promising potential for epilepsy PRS in diagnosis and risk prediction.
Small sample sizes and low ancestral diversity of current epilepsy genome-wide association studies show, however, a need for larger and more diverse studies before epilepsy PRS could be properly implemented in the clinic.
SONAR
(2023)
Accurate and comprehensive nursing documentation is essential to ensure quality patient care. To streamline this process, we present SONAR, a publicly available dataset of nursing activities recorded using inertial sensors in a nursing home. The dataset includes 14 sensor streams, such as acceleration and angular velocity, and 23 activities recorded by 14 caregivers using five sensors for 61.7 hours. The caregivers wore the sensors as they performed their daily tasks, allowing for continuous monitoring of their activities. We additionally provide machine learning models that recognize the nursing activities given the sensor data. In particular, we present benchmarks for three deep learning model architectures and evaluate their performance using different metrics and sensor locations. Our dataset, which can be used for research on sensor-based human activity recognition in real-world settings, has the potential to improve nursing care by providing valuable insights that can identify areas for improvement, facilitate accurate documentation, and tailor care to specific patient conditions.
Background:
The medical care of patients with myositis is a great challenge in clinical practice. This is due to the rarity of these disease, the complexity of diagnosis and management as well as the lack of systematic analyses.
Objectives:
Therefore, the aim of this project was to obtain an overview of the current care of myositis patients in Germany and to evaluate epidemiological trends in recent years.
Methods:
In collaboration with BARMER Insurance, retrospective analysis of outpatient and inpatient data from an average of approximately 8.7 million insured patients between January 2005 and December 2019 was performed using ICD-10 codes for myositis for identification of relevant data.
In addition, a comparative analysis was performed between myositis patients and an age-matched comparison group from other populations insured by BARMER.
Results:
45,800 BARMER-insured individuals received a diagnosis of myositis during the observation period, with a relatively stable prevalence throughout. With regard to comorbidities, a significantly higher rate of cardiovascular disease as well as neoplasm was observed compared to the control group within the BARMER-insured population. In addition, myositis patients suffer more frequently from psychiatric disorders, such as depression and somatoform disorders.
However, the ICD-10 catalogue only includes the specific coding of "dermatomyositis" and "polymyositis" and thus does not allow for a sufficient analysis of all idiopathic inflammatory myopathies subtypes.
Conclusion:
The current data provide a comprehensive epidemiological analysis of myositis in Germany, highlighting the multimorbidity of myositis patients. This underlines the need for multidisciplinary management. However, the ICD-10 codes currently still in use do not allow for specific analysis of the subtypes of myositis.
The upcoming ICD-11 coding may improve future analyses in this regard.
Over the past years, NGS has become a crucial workhorse for open-view pathogen diagnostics.
Yet, long turnaround times result from using massively parallel high-throughput technologies as the analysis can only be performed after sequencing has finished. The interpretation of results can further be challenged by contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data.
We implemented PathoLive, a real-time diagnostics pipeline for the detection of pathogens from clinical samples hours before sequencing has finished.
Based on real-time alignment with HiLive2, mappings are scored with respect to common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms.
The results are visualized using an interactive taxonomic tree that provides an easily interpretable overview of the relevance of hits. For a human plasma sample that was spiked in vitro with six pathogenic viruses, all agents were clearly detected after only 40 of 200 sequencing cycles.
For a real-world sample from Sudan, the results correctly indicated the presence of Crimean-Congo hemorrhagic fever virus. In a second real-world dataset from the 2019 SARS-CoV-2 outbreak in Wuhan, we found the presence of a SARS coronavirus as the most relevant hit without the novel virus reference genome being included in the database.
For all samples, clinically irrelevant hits were correctly de-emphasized.
Our approach is valuable to obtain fast and accurate NGS-based pathogen identifications and correctly prioritize and visualize them based on their clinical significance: PathoLive is open source and available on GitLab and BioConda.
This vision article outlines the main building blocks of what we term AI Compliance, an effort to bridge two complementary research areas: computer science and the law.
Such research has the goal to model, measure, and affect the quality of AI artifacts, such as data, models, and applications, to then facilitate adherence to legal standards.
DrDimont: explainable drug response prediction from differential analysis of multi-omics networks
(2022)
Motivation:
While it has been well established that drugs affect and help patients differently, personalized drug response predictions remain challenging.
Solutions based on single omics measurements have been proposed, and networks provide means to incorporate molecular interactions into reasoning.
However, how to integrate the wealth of information contained in multiple omics layers still poses a complex problem.
Results:
We present DrDimont, Drug response prediction from Differential analysis of multi-omics networks.
It allows for comparative conclusions between two conditions and translates them into differential drug response predictions.
DrDimont focuses on molecular interactions.
It establishes condition-specific networks from correlation within an omics layer that are then reduced and combined into heterogeneous, multi-omics molecular networks. A novel semi-local, path-based integration step ensures integrative conclusions. Differential predictions are derived from comparing the condition-specific integrated networks.
DrDimont's predictions are explainable, i.e. molecular differences that are the source of high differential drug scores can be retrieved. We predict differential drug response in breast cancer using transcriptomics, proteomics, phosphosite and metabolomics measurements and contrast estrogen receptor positive and receptor negative patients. DrDimont performs better than drug prediction based on differential protein expression or PageRank when evaluating it on ground truth data from cancer cell lines. We find proteomic and phosphosite layers to carry most information for distinguishing drug response.
Residential segregation is a wide-spread phenomenon that can be observed in almost every major city.
In these urban areas residents with different racial or socioeconomic background tend to form homogeneous clusters.
Schelling's famous agent-based model for residential segregation explains how such clusters can form even if all agents are tolerant, i.e., if they agree to live in mixed neighborhoods.
For segregation to occur, all it needs is a slight bias towards agents preferring similar neighbors.
Very recently, Schelling's model has been investigated from a game-theoretic point of view with selfish agents that strategically select their residential location.
In these games, agents can improve on their current location by performing a location swap with another agent who is willing to swap.
We significantly deepen these investigations by studying the influence of the underlying topology modeling the residential area on the existence of equilibria, the Price of Anarchy and on the dynamic properties of the resulting strategic multi-agent system. Moreover, as a new conceptual contribution, we also consider the influence of locality, i.e., if the location swaps are restricted to swaps of neighboring agents.
We give improved almost tight bounds on the Price of Anarchy for arbitrary underlying graphs and we present (almost) tight bounds for regular graphs, paths and cycles. Moreover, we give almost tight bounds for grids, which are commonly used in empirical studies.
For grids we also show that locality has a severe impact on the game dynamics.