TY  - JOUR
A1  - Datta, Suparno
A1  - Sachs, Jan Philipp
A1  - Freitas da Cruz, Harry
A1  - Martensen, Tom
A1  - Bode, Philipp
A1  - Morassi Sasso, Ariane
A1  - Glicksberg, Benjamin S.
A1  - Böttinger, Erwin
T1  - FIBER
BT  - enabling flexible retrieval of electronic health records data for clinical predictive modeling
JF  - JAMIA open
N2  - Objectives: 
The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames. 

Materials and Methods: 
FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER's capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models. 

Results:
Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case.

Conclusion: 
FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.
KW  - databases
KW  - factual
KW  - electronic health records
KW  - information storage and
KW  - retrieval
KW  - workflow
KW  - software/instrumentation
Y1  - 2021
U6  - https://doi.org/10.1093/jamiaopen/ooab048
SN  - 2574-2531
VL  - 4
IS  - 3
PB  - Oxford Univ. Press
CY  - Oxford
ER  - 
TY  - JOUR
A1  - Gronau, Norbert
A1  - Schaefer, Martin
T1  - Why metadata matters for the future of copyright
JF  - European Intellectual Property Review
N2  - In the copyright industries of the 21st century, metadata is the grease required to make the engine of copyright run smoothly and powerfully for the benefit of creators, copyright industries and users alike. However, metadata is difficult to acquire and even more difficult to keep up to date as the rights in content are mostly multi-layered, fragmented, international and volatile. This article explores the idea of a neutral metadata search and enhancement tool that could constitute a buffer to safeguard the interests of the various proprietary database owners and avoid the shortcomings of centralised databases.
KW  - copyright
KW  - databases
KW  - metadata
KW  - music industry
Y1  - 2021
SN  - 0142-0461
VL  - 43
IS  - 8
SP  - 488
EP  - 494
PB  - Sweet & Maxwell
CY  - London
ER  - 
TY  - JOUR
A1  - Caruccio, Loredana
A1  - Deufemia, Vincenzo
A1  - Naumann, Felix
A1  - Polese, Giuseppe
T1  - Discovering relaxed functional dependencies based on multi-attribute dominance
JF  - IEEE transactions on knowledge and data engineering
N2  - With the advent of big data and data lakes, data are often integrated from multiple sources. Such integrated data are often of poor quality, due to inconsistencies, errors, and so forth. One way to check the quality of data is to infer functional dependencies (fds). However, in many modern applications it might be necessary to extract properties and relationships that are not captured through fds, due to the necessity to admit exceptions, or to consider similarity rather than equality of data values. Relaxed fds (rfds) have been introduced to meet these needs, but their discovery from data adds further complexity to an already complex problem, also due to the necessity of specifying similarity and validity thresholds. We propose Domino, a new discovery algorithm for rfds that exploits the concept of dominance in order to derive similarity thresholds of attribute values while inferring rfds. An experimental evaluation on real datasets demonstrates the discovery performance and the effectiveness of the proposed algorithm.
KW  - Complexity theory
KW  - Approximation algorithms
KW  - Big Data
KW  - Distributed
KW  - databases
KW  - Semantics
KW  - Lakes
KW  - Functional dependencies
KW  - data profiling
KW  - data cleansing
Y1  - 2020
U6  - https://doi.org/10.1109/TKDE.2020.2967722
SN  - 1041-4347
SN  - 1558-2191
VL  - 33
IS  - 9
SP  - 3212
EP  - 3228
PB  - Institute of Electrical and Electronics Engineers
CY  - New York, NY
ER  - 
TY  - JOUR
A1  - Klie, Sebastian
A1  - Nikoloski, Zoran
A1  - Selbig, Joachim
T1  - Biological cluster evaluation for gene function prediction
JF  - Journal of computational biology
N2  - Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.
KW  - algorithms
KW  - biochemical networks
KW  - combinatorics
KW  - computational molecular biology
KW  - databases
KW  - functional genomics
KW  - gene expression
KW  - NP-completeness
Y1  - 2014
U6  - https://doi.org/10.1089/cmb.2009.0129
SN  - 1066-5277
SN  - 1557-8666
VL  - 21
IS  - 6
SP  - 428
EP  - 445
PB  - Liebert
CY  - New Rochelle
ER  -