Hasso-Plattner-Institut für Digital Engineering GmbH
Refine
Year of publication
Document Type
- Article (221)
- Doctoral Thesis (105)
- Other (86)
- Monograph/Edited Volume (42)
- Postprint (22)
- Conference Proceeding (6)
- Part of a Book (1)
- Habilitation Thesis (1)
- Report (1)
Keywords
- MOOC (42)
- digital education (37)
- e-learning (37)
- Digitale Bildung (34)
- online course creation (34)
- online course design (34)
- Kursdesign (33)
- Micro Degree (33)
- Online-Lehre (33)
- Onlinekurs (33)
Introduction:
Improving the surveillance of tuberculosis (TB) is especially important for multidrug-resistant (MDR) and extensively drug- resistant (XDR) TB. The large amount of publicly available whole genome sequencing (WGS) data for TB gives us the chance to re-use data and to perform additional analyses at a large scale.
Aim:
We assessed the usefulness of raw WGS data of global MDR/XDR Mycobacterium tuberculosis isolates available from public repositories to improve TB surveillance.
Methods:
We extracted raw WGS data and the related metadata of M. tuberculosis isolates available from the Sequence Read Archive. We compared this public dataset with WGS data and metadata of 131 MDR- and XDR M. tuberculosis isolates from Germany in 2012 and 2013.
Results:
We aggregated a dataset that included 1,081 MDR and 250 XDR isolates among which we identified 133 molecular clusters. In 16 clusters, the isolates were from at least two different countries. For example, Cluster 2 included 56 MDR/ XDR isolates from Moldova, Georgia and Germany. When comparing the WGS data from Germany with the public dataset, we found that 11 clusters contained at least one isolate from Germany and at least one isolate from another country. We could, therefore, connect TB cases despite missing epidemiological information.
Conclusion:
We demonstrated the added value of using WGS raw data from public repositories to contribute to TB surveillance. Comparing the German with the public dataset, we identified potential international transmission events. Thus, using this approach might support the interpretation of national surveillance results in an international context.
Importance:
Risk variants in the apolipoprotein L1 (APOL1 [OMIM 603743]) gene on chromosome 22 are common in individuals of West African ancestry and confer increased risk of kidney failure for people with African ancestry and hypertension.
Whether disclosing APOL1 genetic testing results to patients of African ancestry and their clinicians affects blood pressure, kidney disease screening, or patient behaviors is unknown.
Objective:
To determine the effects of testing and disclosing APOL1 genetic results to patients of African ancestry with hypertension and their clinicians.
Design, Setting, and Participants:
This pragmatic randomized clinical trial randomly assigned 2050 adults of African ancestry with hypertension and without existing chronic kidney disease in 2 US health care systems from November 1, 2014, through November 28, 2016; the final date of follow-up was January 16, 2018.
Patients were randomly assigned to undergo immediate (intervention) or delayed (waiting list control group) APOL1 testing in a 7:1 ratio. Statistical analysis was performed from May 1, 2018, to July 31, 2020.
Interventions:
Patients randomly assigned to the intervention group received APOL1 genetic testing results from trained staff; their clinicians received results through clinical decision support in electronic health records. Waiting list control patients received the results after their 12-month followup visit.
Main Outcomes and Measures:
Coprimary outcomes were the change in 3-month systolic blood pressure and 12-month urine kidney disease screening comparing intervention patients with high-risk APOL1 genotypes and those with low-risk APOL1 genotypes.
Secondary outcomes compared these outcomes between intervention group patients with high-risk APOL1 genotypes and controls. Exploratory analyses included psychobehavioral factors.
Results:
Among 2050 randomly assigned patients (1360 women [66%]; mean [SD] age, 53 [10] years), the baseline mean (SD) systolic blood pressure was significantly higher in patients with high-risk APOL1 genotypes vs those with low-risk APOL1 genotypes and controls (137 [21] vs 134 [19] vs 133 [19] mm Hg; P = .003 for high-risk vs low-risk APOL1 genotypes; P = .001for high-risk APOL1 genotypes vs controls).
At 3 months, the mean (SD) change in systolic blood pressure was significantly greater in patients with high-risk APOL1 genotypes vs those with low-risk APOL1 genotypes (6 [18] vs 3 [18] mm Hg; P = .004) and controls (6 [18] vs 3 [19] mm Hg; P = .01).
At 12 months, there was a 12% increase in urine kidney disease testing among patients with high-risk APOL1genotypes (from 39 of 234 [17%] to 68 of 234 [29%]) vs a 6% increase among those with low-risk APOL1 genotypes (from 278 of 1561[18%] to 377 of 1561[24%]; P = .10) and a 7% increase among controls (from 33 of 255 [13%] to 50 of 255 [20%]; P = .01). In response to testing, patients with high-risk APOL1 genotypes reported more changes in lifestyle (a subjective measure that included better dietary and exercise habits; 129 of 218 [59%] vs 547 of 1468 [37%]; P < .001) and increased blood pressure medication use (21 of 218 [10%] vs 68 of 1468 [5%]; P = .005) vs those with low-risk APOL1 genotypes; 1631of 1686 (97%) declared they would get tested again.
Conclusions and relevance:
In this randomized clinical trial, disclosing APOL1 genetic testing results to patients of African ancestry with hypertension and their clinicians was associated with a greater reduction in systolic blood pressure, increased kidney disease screening, and positive self- reported behavior changes in those with high-risk genotypes.
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel humaninfecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from nextgeneration sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homologybased algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotideresolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easyto-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel humaninfecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from nextgeneration sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homologybased algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotideresolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easyto-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Detecting anomalous subsequences in time series data is an important task in areas ranging from manufacturing processes over finance applications to health care monitoring. An anomaly can indicate important events, such as production faults, delivery bottlenecks, system defects, or heart flicker, and is therefore of central interest. Because time series are often large and exhibit complex patterns, data scientists have developed various specialized algorithms for the automatic detection of such anomalous patterns. The number and variety of anomaly detection algorithms has grown significantly in the past and, because many of these solutions have been developed independently and by different research communities, there is no comprehensive study that systematically evaluates and compares the different approaches. For this reason, choosing the best detection technique for a given anomaly detection task is a difficult challenge.
This comprehensive, scientific study carefully evaluates most state-of-the-art anomaly detection algorithms. We collected and re-implemented 71 anomaly detection algorithms from different domains and evaluated them on 976 time series datasets. The algorithms have been selected from different algorithm families and detection approaches to represent the entire spectrum of anomaly detection techniques. In the paper, we provide a concise overview of the techniques and their commonalities; we evaluate their individual strengths and weaknesses and, thereby, consider factors, such as effectiveness, efficiency, and robustness. Our experimental results should ease the algorithm selection problem and open up new research directions.
Looking back and forward
(2021)
Many studies have shown that abdominal adiposity is more strongly related to health risks than peripheral adiposity. However, the underlying pathways are still poorly understood. In this cross-sectional study using data from RNA-sequencing experiments and whole-body MRI scans of 200 participants in the EPIC-Potsdam cohort, our aim was to identify novel genes whose gene expression in subcutaneous adipose tissue has an effect on body fat mass (BFM) and body fat distribution (BFD). The analysis identified 625 genes associated with adiposity, of which 531 encode a known protein and 487 are novel candidate genes for obesity. Enrichment analyses indicated that BFM-associated genes were characterized by their higher than expected involvement in cellular, regulatory and immune system processes, and BFD-associated genes by their involvement in cellular, metabolic, and regulatory processes. Mendelian Randomization analyses suggested that the gene expression of 69 genes was causally related to BFM and BFD. Six genes were replicated in UK Biobank. In this study, we identified novel genes for BFM and BFD that are BFM- and BFD-specific, involved in different molecular processes, and whose up-/downregulated gene expression may causally contribute to obesity.
A large-scale GWAS provides insight on diabetes-dependent genetic effects on the glomerular filtration rate, a common metric to monitor kidney health in disease.
Reduced glomerular filtration rate (GFR) can progress to kidney failure. Risk factors include genetics and diabetes mellitus (DM), but little is known about their interaction. We conducted genome-wide association meta-analyses for estimated GFR based on serum creatinine (eGFR), separately for individuals with or without DM (n(DM) = 178,691, n(noDM) = 1,296,113). Our genome-wide searches identified (i) seven eGFR loci with significant DM/noDM-difference, (ii) four additional novel loci with suggestive difference and (iii) 28 further novel loci (including CUBN) by allowing for potential difference. GWAS on eGFR among DM individuals identified 2 known and 27 potentially responsible loci for diabetic kidney disease. Gene prioritization highlighted 18 genes that may inform reno-protective drug development. We highlight the existence of DM-only and noDM-only effects, which can inform about the target group, if respective genes are advanced as drug targets. Largely shared effects suggest that most drug interventions to alter eGFR should be effective in DM and noDM.
The QT interval is a heritable electrocardiographic measure associated with arrhythmia risk when prolonged. Here, the authors used a series of genetic analyses to identify genetic loci, pathways, therapeutic targets, and relationships with cardiovascular disease.
The QT interval is an electrocardiographic measure representing the sum of ventricular depolarization and repolarization, estimated by QRS duration and JT interval, respectively. QT interval abnormalities are associated with potentially fatal ventricular arrhythmia. Using genome-wide multi-ancestry analyses (>250,000 individuals) we identify 177, 156 and 121 independent loci for QT, JT and QRS, respectively, including a male-specific X-chromosome locus. Using gene-based rare-variant methods, we identify associations with Mendelian disease genes. Enrichments are observed in established pathways for QT and JT, and previously unreported genes indicated in insulin-receptor signalling and cardiac energy metabolism. In contrast for QRS, connective tissue components and processes for cell growth and extracellular matrix interactions are significantly enriched. We demonstrate polygenic risk score associations with atrial fibrillation, conduction disease and sudden cardiac death. Prioritization of druggable genes highlight potential therapeutic targets for arrhythmia. Together, these results substantially advance our understanding of the genetic architecture of ventricular depolarization and repolarization.
Process mining techniques can be used to analyse business processes using the data logged during their execution.
These techniques are leveraged in a wide range of domains, including healthcare, where it focuses mainly on the analysis of diagnostic, treatment, and organisational processes.
Despite the huge amount of data generated in hospitals by staff and machinery involved in healthcare processes, there is no evidence of a systematic uptake of process mining beyond targeted case studies in a research context.
When developing and using process mining in healthcare, distinguishing characteristics of healthcare processes such as their variability and patient-centred focus require targeted attention.
Against this background, the Process-Oriented Data Science in Healthcare Alliance has been established to propagate the research and application of techniques targeting the data-driven improvement of healthcare processes.
This paper, an initiative of the alliance, presents the distinguishing characteristics of the healthcare domain that need to be considered to successfully use process mining, as well as open challenges that need to be addressed by the community in the future.
Combining robust proteomics instrumentation with (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by similar to 90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale highthroughput proteomics studies.