TY  - GEN
A1  - Podlesny, Nikolai Jannik
A1  - Kayem, Anne V. D. M.
A1  - von Schorlemer, Stephan
A1  - Uflacker, Matthias
T1  - Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing
T2  - Database and Expert Systems Applications, DEXA 2018, PT I
N2  - Minimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated the effectiveness of our proposed approach with an enriched dataset drawn from multiple real-world data sources, and augmented with synthetic values generated in close alignment with the real-world data distributions. Our results indicate that in-memory processing drops attribute selection time for the mpUCC candidates from 400s to 100s, while significantly reducing information loss. In addition, we achieve a time complexity speed-up of O(3(n/3)) approximate to O(1.4422(n)).
Y1  - 2018
SN  - 978-3-319-98809-2
SN  - 978-3-319-98808-5
U6  - https://doi.org/10.1007/978-3-319-98809-2_6
SN  - 0302-9743
SN  - 1611-3349
VL  - 11029
SP  - 85
EP  - 100
PB  - Springer
CY  - Cham
ER  - 
TY  - GEN
A1  - Perscheid, Cindy
A1  - Faber, Lukas
A1  - Kraus, Milena
A1  - Arndt, Paul
A1  - Janke, Michael
A1  - Rehfeldt, Sebastian
A1  - Schubotz, Antje
A1  - Slosarek, Tamara
A1  - Uflacker, Matthias
T1  - A tissue-aware gene selection approach for analyzing multi-tissue gene expression data
T2  - 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
N2  - High-throughput RNA sequencing (RNAseq) produces large data sets containing expression levels of thousands of genes. The analysis of RNAseq data leads to a better understanding of gene functions and interactions, which eventually helps to study diseases like cancer and develop effective treatments. Large-scale RNAseq expression studies on cancer comprise samples from multiple cancer types and aim to identify their distinct molecular characteristics. Analyzing samples from different cancer types implies analyzing samples from different tissue origin. Such multi-tissue RNAseq data sets require a meaningful analysis that accounts for the inherent tissue-related bias: The identified characteristics must not originate from the differences in tissue types, but from the actual differences in cancer types. However, current analysis procedures do not incorporate that aspect. As a result, we propose to integrate a tissue-awareness into the analysis of multi-tissue RNAseq data. We introduce an extension for gene selection that provides a tissue-wise context for every gene and can be flexibly combined with any existing gene selection approach. We suggest to expand conventional evaluation by additional metrics that are sensitive to the tissue-related bias. Evaluations show that especially low complexity gene selection approaches profit from introducing tissue-awareness.
KW  - RNAseq
KW  - gene selection
KW  - tissue-awareness
KW  - TCGA
KW  - GTEx
Y1  - 2018
SN  - 978-1-5386-5488-0
U6  - https://doi.org/10.1109/BIBM.2018.8621189
SN  - 2156-1125
SN  - 2156-1133
SP  - 2159
EP  - 2166
PB  - IEEE
CY  - New York
ER  -