Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing

Podlesny, Nikolai Jannik; Kayem, Anne V. D. M.; von Schorlemer, Stephan; Uflacker, Matthias

doi:10.1007/978-3-319-98809-2_6

Minimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated theMinimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated the effectiveness of our proposed approach with an enriched dataset drawn from multiple real-world data sources, and augmented with synthetic values generated in close alignment with the real-world data distributions. Our results indicate that in-memory processing drops attribute selection time for the mpUCC candidates from 400s to 100s, while significantly reducing information loss. In addition, we achieve a time complexity speed-up of O(3(n/3)) approximate to O(1.4422(n)).… show more

Author details:	Nikolai Jannik Podlesny GND, Anne V. D. M. Kayem GND, Stephan von Schorlemer, Matthias Uflacker
DOI:	https://doi.org/10.1007/978-3-319-98809-2_6
ISBN:	978-3-319-98809-2
ISBN:	978-3-319-98808-5
ISSN:	0302-9743
ISSN:	1611-3349
Title of parent work (English):	Database and Expert Systems Applications, DEXA 2018, PT I
Publisher:	Springer
Place of publishing:	Cham
Publication type:	Other
Language:	English
Date of first publication:	2018/08/09
Publication year:	2018
Release date:	2022/02/24
Volume:	11029
Number of pages:	16
First page:	85
Last Page:	100
Organizational units:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer review:	Referiert

Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing

Export metadata

Additional Services