TY  - THES
A1  - Perscheid, Cindy
T1  - Integrative biomarker detection using prior knowledge on gene expression data sets
T1  - Integrative Biomarker-Erkennung auf Genexpressions-Daten mithilfe von biologischem Vorwissen
N2  - Gene expression data is analyzed to identify biomarkers, e.g. relevant genes, which serve for diagnostic, predictive, or prognostic use. Traditional approaches for biomarker detection select distinctive features from the data based exclusively on the signals therein, facing multiple shortcomings in regards to overfitting, biomarker robustness, and actual biological relevance. Prior knowledge approaches are expected to address these issues by incorporating prior biological knowledge, e.g. on gene-disease associations, into the actual analysis. However, prior knowledge approaches are currently not widely applied in practice because they are often use-case specific and seldom applicable in a different scope. This leads to a lack of comparability of prior knowledge approaches, which in turn makes it currently impossible to assess their effectiveness in a broader context.

Our work addresses the aforementioned issues with three contributions. Our first contribution provides formal definitions for both prior knowledge and the flexible integration thereof into the feature selection process. Central to these concepts is the automatic retrieval of prior knowledge from online knowledge bases, which allows for streamlining the retrieval process and agreeing on a uniform definition for prior knowledge. We subsequently describe novel and generalized prior knowledge approaches that are flexible regarding the used prior knowledge and applicable to varying use case domains. Our second contribution is the benchmarking platform Comprior. Comprior applies the aforementioned concepts in practice and allows for flexibly setting up comprehensive benchmarking studies for examining the performance of existing and novel prior knowledge approaches. It streamlines the retrieval of prior knowledge and allows for combining it with prior knowledge approaches. Comprior demonstrates the practical applicability of our concepts and further fosters the overall development and comparability of prior knowledge approaches. Our third contribution is a comprehensive case study on the effectiveness of prior knowledge approaches. For that, we used Comprior and tested a broad range of both traditional and prior knowledge approaches in combination with multiple knowledge bases on data sets from multiple disease domains. Ultimately, our case study constitutes a thorough assessment of a) the suitability of selected knowledge bases for integration, b) the impact of prior knowledge being applied at different integration levels, and c) the improvements in terms of classification performance, biological relevance, and overall robustness.

In summary, our contributions demonstrate that generalized concepts for prior knowledge and a streamlined retrieval process improve the applicability of prior knowledge approaches. Results from our case study show that the integration of prior knowledge positively affects biomarker results, particularly regarding their robustness. Our findings provide the first in-depth insights on the effectiveness of prior knowledge approaches and build a valuable foundation for future research.
N2  - Biomarker sind charakteristische biologische Merkmale mit diagnostischer oder prognostischer Aussagekraft. Auf der molekularen Ebene sind dies Gene mit einem krankheitsspezifischen Expressionsmuster, welche mittels der Analyse von Genexpressionsdaten identifiziert werden. Traditionelle Ansätze für diese Art von Biomarker Detection wählen Gene als Biomarker ausschließlich anhand der vorhandenen Signale im Datensatz aus. Diese Vorgehensweise zeigt jedoch Schwächen insbesondere in Bezug auf die Robustheit und tatsächliche biologische Relevanz der identifizierten Biomarker. Verschiedene Forschungsarbeiten legen nahe, dass die Berücksichtigung des biologischen Kontexts während des Selektionsprozesses diese Schwächen ausgleichen kann. Sogenannte wissensbasierte Ansätze für Biomarker Detection beziehen vorhandenes biologisches Wissen, beispielsweise über Zusammenhänge zwischen bestimmten Genen und Krankheiten, direkt in die Analyse mit ein. Die Anwendung solcher Verfahren ist in der Praxis jedoch derzeit nicht weit verbreitet, da existierende Methoden oft spezifisch für einen bestimmten Anwendungsfall entwickelt wurden und sich nur mit großem Aufwand auf andere Anwendungsgebiete übertragen lassen. Dadurch sind Vergleiche untereinander kaum möglich, was es wiederum nicht erlaubt die Effektivität von wissensbasierten Methoden in einem breiteren Kontext zu untersuchen.

Die vorliegende Arbeit befasst sich mit den vorgenannten Herausforderungen für wissensbasierte Ansätze. In einem ersten Schritt legen wir formale und einheitliche Definitionen für vorhandenes biologisches Wissen sowie ihre flexible Integration in den Biomarker-Auswahlprozess fest. Der Kerngedanke unseres Ansatzes ist die automatisierte Beschaffung von biologischem Wissen aus im Internet frei verfügbaren Wissens-Datenbanken. Dies erlaubt eine Vereinfachung der Kuratierung sowie die Festlegung einer einheitlichen Definition für biologisches Wissen. Darauf aufbauend beschreiben wir generalisierte wissensbasierte Verfahren, welche flexibel auf verschiedene Anwendungsfalle anwendbar sind. In einem zweiten Schritt haben wir die Benchmarking-Plattform Comprior entwickelt, welche unsere theoretischen Konzepte in einer praktischen Anwendung realisiert. Comprior ermöglicht die schnelle Umsetzung von umfangreichen Experimenten für den Vergleich von wissensbasierten Ansätzen. Comprior übernimmt die Beschaffung von biologischem Wissen und ermöglicht dessen beliebige Kombination mit wissensbasierten Ansätzen. Comprior demonstriert damit die praktische Umsetzbarkeit unserer theoretischen Konzepte und unterstützt zudem die technische Realisierung und Vergleichbarkeit wissensbasierter Ansätze. In einem dritten Schritt untersuchen wir die Effektivität wissensbasierter Ansätze im Rahmen einer umfangreichen Fallstudie. Mithilfe von Comprior vergleichen wir die Ergebnisse traditioneller und wissensbasierter Ansätze im Kontext verschiedener Krankheiten, wobei wir für wissensbasierte Ansätze auch verschiedene Wissens-Datenbanken verwenden. Unsere Fallstudie untersucht damit a) die Eignung von ausgewählten Wissens-Datenbanken für deren Einsatz bei wissensbasierten Ansätzen, b) den Einfluss verschiedener Integrationskonzepte für biologisches Wissen auf den Biomarker-Auswahlprozess, und c) den Grad der Verbesserung in Bezug auf die Klassifikationsleistung, biologische Relevanz und allgemeine Robustheit der selektierten Biomarker.

Zusammenfassend demonstriert unsere Arbeit, dass generalisierte Konzepte für biologisches Wissen und dessen vereinfachte Kuration die praktische Anwendbarkeit von wissensbasierten Ansätzen erleichtern. Die Ergebnisse unserer Fallstudie zeigen, dass die Integration von vorhandenem biologischen Wissen einen positiven Einfluss auf die selektierten Biomarker hat, insbesondere in Bezug auf ihre biologische Relevanz. Diese erstmals umfassenderen Erkenntnisse zur Effektivität von wissensbasierten Ansätzen bilden eine wertvolle Grundlage für zukünftige Forschungsarbeiten.
KW  - gene expression
KW  - biomarker detection
KW  - prior knowledge
KW  - feature selection
KW  - Biomarker-Erkennung
KW  - Merkmalsauswahl
KW  - Gen-Expression
KW  - biologisches Vorwissen
Y1  - 2023
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-582418
ER  - 
TY  - GEN
A1  - Perscheid, Cindy
T1  - Comprior: facilitating the implementation and automated benchmarking of prior knowledge-based feature selection approaches on gene expression data sets
T2  - Zweitveröffentlichungen der Universität Potsdam : Reihe der Digital Engineering Fakultät
N2  - Background
Reproducible benchmarking is important for assessing the effectiveness of novel feature selection approaches applied on gene expression data, especially for prior knowledge approaches that incorporate biological information from online knowledge bases. However, no full-fledged benchmarking system exists that is extensible, provides built-in feature selection approaches, and a comprehensive result assessment encompassing classification performance, robustness, and biological relevance. Moreover, the particular needs of prior knowledge feature selection approaches, i.e. uniform access to knowledge bases, are not addressed. As a consequence, prior knowledge approaches are not evaluated amongst each other, leaving open questions regarding their effectiveness.

Results
We present the Comprior benchmark tool, which facilitates the rapid development and effortless benchmarking of feature selection approaches, with a special focus on prior knowledge approaches. Comprior is extensible by custom approaches, offers built-in standard feature selection approaches, enables uniform access to multiple knowledge bases, and provides a customizable evaluation infrastructure to compare multiple feature selection approaches regarding their classification performance, robustness, runtime, and biological relevance.

Conclusion
Comprior allows reproducible benchmarking especially of prior knowledge approaches, which facilitates their applicability and for the first time enables a comprehensive assessment of their effectiveness
T3  - Zweitveröffentlichungen der Universität Potsdam : Reihe der Digital Engineering Fakultät - 010 
KW  - Feature selection
KW  - Prior knowledge
KW  - Gene expression
KW  - Reproducible benchmarking
Y1  - 2022
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:517-opus4-548943
SP  - 1
EP  - 15
PB  - Universitätsverlag Potsdam
CY  - Potsdam
ER  - 
TY  - JOUR
A1  - Perscheid, Cindy
T1  - Comprior
BT  - Facilitating the implementation and automated benchmarking of prior knowledge-based feature selection approaches on gene expression data sets
JF  - BMC Bioinformatics
N2  - Background
Reproducible benchmarking is important for assessing the effectiveness of novel feature selection approaches applied on gene expression data, especially for prior knowledge approaches that incorporate biological information from online knowledge bases. However, no full-fledged benchmarking system exists that is extensible, provides built-in feature selection approaches, and a comprehensive result assessment encompassing classification performance, robustness, and biological relevance. Moreover, the particular needs of prior knowledge feature selection approaches, i.e. uniform access to knowledge bases, are not addressed. As a consequence, prior knowledge approaches are not evaluated amongst each other, leaving open questions regarding their effectiveness.

Results
We present the Comprior benchmark tool, which facilitates the rapid development and effortless benchmarking of feature selection approaches, with a special focus on prior knowledge approaches. Comprior is extensible by custom approaches, offers built-in standard feature selection approaches, enables uniform access to multiple knowledge bases, and provides a customizable evaluation infrastructure to compare multiple feature selection approaches regarding their classification performance, robustness, runtime, and biological relevance.

Conclusion
Comprior allows reproducible benchmarking especially of prior knowledge approaches, which facilitates their applicability and for the first time enables a comprehensive assessment of their effectiveness
KW  - Feature selection
KW  - Prior knowledge
KW  - Gene expression
KW  - Reproducible benchmarking
Y1  - 2021
U6  - https://doi.org/10.1186/s12859-021-04308-z
SN  - 1471-2105
VL  - 22
SP  - 1
EP  - 15
PB  - Springer Nature
CY  - London
ER  - 
TY  - GEN
A1  - Perscheid, Cindy
A1  - Faber, Lukas
A1  - Kraus, Milena
A1  - Arndt, Paul
A1  - Janke, Michael
A1  - Rehfeldt, Sebastian
A1  - Schubotz, Antje
A1  - Slosarek, Tamara
A1  - Uflacker, Matthias
T1  - A tissue-aware gene selection approach for analyzing multi-tissue gene expression data
T2  - 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
N2  - High-throughput RNA sequencing (RNAseq) produces large data sets containing expression levels of thousands of genes. The analysis of RNAseq data leads to a better understanding of gene functions and interactions, which eventually helps to study diseases like cancer and develop effective treatments. Large-scale RNAseq expression studies on cancer comprise samples from multiple cancer types and aim to identify their distinct molecular characteristics. Analyzing samples from different cancer types implies analyzing samples from different tissue origin. Such multi-tissue RNAseq data sets require a meaningful analysis that accounts for the inherent tissue-related bias: The identified characteristics must not originate from the differences in tissue types, but from the actual differences in cancer types. However, current analysis procedures do not incorporate that aspect. As a result, we propose to integrate a tissue-awareness into the analysis of multi-tissue RNAseq data. We introduce an extension for gene selection that provides a tissue-wise context for every gene and can be flexibly combined with any existing gene selection approach. We suggest to expand conventional evaluation by additional metrics that are sensitive to the tissue-related bias. Evaluations show that especially low complexity gene selection approaches profit from introducing tissue-awareness.
KW  - RNAseq
KW  - gene selection
KW  - tissue-awareness
KW  - TCGA
KW  - GTEx
Y1  - 2018
SN  - 978-1-5386-5488-0
U6  - https://doi.org/10.1109/BIBM.2018.8621189
SN  - 2156-1125
SN  - 2156-1133
SP  - 2159
EP  - 2166
PB  - IEEE
CY  - New York
ER  - 
TY  - JOUR
A1  - Perscheid, Cindy
A1  - Grasnick, Bastien
A1  - Uflacker, Matthias
T1  - Integrative Gene Selection on Gene Expression Data
BT  - Providing Biological Context to Traditional Approaches
JF  - Journal of Integrative Bioinformatics
N2  - The advance of high-throughput RNA-Sequencing techniques enables researchers to analyze the complete gene activity in particular cells. From the insights of such analyses, researchers can identify disease-specific expression profiles, thus understand complex diseases like cancer, and eventually develop effective measures for diagnosis and treatment. The high dimensionality of gene expression data poses challenges to its computational analysis, which is addressed with measures of gene selection. Traditional gene selection approaches base their findings on statistical analyses of the actual expression levels, which implies several drawbacks when it comes to accurately identifying the underlying biological processes. In turn, integrative approaches include curated information on biological processes from external knowledge bases during gene selection, which promises to lead to better interpretability and improved predictive performance. Our work compares the performance of traditional and integrative gene selection approaches. Moreover, we propose a straightforward approach to integrate external knowledge with traditional gene selection approaches. We introduce a framework enabling the automatic external knowledge integration, gene selection, and evaluation. Evaluation results prove our framework to be a useful tool for evaluation and show that integration of external knowledge improves overall analysis results.
KW  - Gene Expression Data Analysis
KW  - Integrative Gene Selection
KW  - Pattern Recognition
KW  - Prior Knowledge
KW  - Knowledge Bases
Y1  - 2019
U6  - https://doi.org/10.1515/jib-2018-0064
SN  - 1613-4516
VL  - 16
IS  - 1
PB  - De Gruyter
CY  - Berlin
ER  -