Improving the efficiency of inclusion dependency detection

Shaabani, Nuhad; Meinel, Christoph

doi:10.1145/3269206.3271724

The detection of all inclusion dependencies (INDs) in an unknown dataset is at the core of any data profiling effort. Apart from the discovery of foreign key relationships, INDs can help perform data integration, integrity checking, schema (re-)design, and query optimization. With the advent of Big Data, the demand increases for efficient INDs discovery algorithms that can scale with the input data size. To this end, we propose S-INDD++ as a scalable system for detecting unary INDs in large datasets. S-INDD++ applies a new stepwise partitioning technique that helps discard a large number of attributes in early phases of the detection by processing the first partitions of smaller sizes. S-INDD++ also extends the concept of the attribute clustering to decide which attributes to be discarded based on the clustering result of each partition. Moreover, in contrast to the state-of-the-art, S-INDD++ does not require the partition to fit into the main memory-which is a highly appreciable property in the face of the ever growing datasets. WeThe detection of all inclusion dependencies (INDs) in an unknown dataset is at the core of any data profiling effort. Apart from the discovery of foreign key relationships, INDs can help perform data integration, integrity checking, schema (re-)design, and query optimization. With the advent of Big Data, the demand increases for efficient INDs discovery algorithms that can scale with the input data size. To this end, we propose S-INDD++ as a scalable system for detecting unary INDs in large datasets. S-INDD++ applies a new stepwise partitioning technique that helps discard a large number of attributes in early phases of the detection by processing the first partitions of smaller sizes. S-INDD++ also extends the concept of the attribute clustering to decide which attributes to be discarded based on the clustering result of each partition. Moreover, in contrast to the state-of-the-art, S-INDD++ does not require the partition to fit into the main memory-which is a highly appreciable property in the face of the ever growing datasets. We conducted an exhaustive evaluation of S-INDD++ by applying it to large datasets with thousands attributes and more than 266 million tuples. The results show the high superiority of S-INDD++ over the state-of-the-art. S-INDD++ reduced up to 50 % of the runtime in comparison with BINDER, and up to 98 % in comparison with S-INDD.… zeige mehr

Verfasserangaben:	Nuhad Shaabani ORCiD GND, Christoph Meinel ORCiD GND
DOI:	https://doi.org/10.1145/3269206.3271724
ISBN:	978-1-4503-6014-2
Titel des übergeordneten Werks (Englisch):	Proceedings of the 27th ACM International Conference on Information and Knowledge Management
Verlag:	Association for Computing Machinery
Verlagsort:	New York
Publikationstyp:	Sonstiges
Sprache:	Englisch
Datum der Erstveröffentlichung:	17.10.2018
Erscheinungsjahr:	2018
Datum der Freischaltung:	02.03.2022
Freies Schlagwort / Tag:	Algorithms; Data mining; Data partitioning; Data profiling
Seitenanzahl:	10
Erste Seite:	207
Letzte Seite:	216
Organisationseinheiten:	Digital Engineering Fakultät / Hasso-Plattner-Institut für Digital Engineering GmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Peer Review:	Referiert

Improving the efficiency of inclusion dependency detection

Metadaten exportieren

Weitere Dienste