Refine
Year of publication
Document Type
- Article (14)
- Monograph/Edited Volume (9)
Language
- English (23)
Is part of the Bibliography
- yes (23)
Keywords
- data profiling (4)
- Datenintegration (3)
- duplicate detection (3)
- similarity measures (3)
- Data Integration (2)
- Functional dependencies (2)
- data matching (2)
- data quality (2)
- data wrangling (2)
- entity resolution (2)
Institute
- Hasso-Plattner-Institut für Digital Engineering gGmbH (23) (remove)
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the wellknown GORDIAN algorithm and "Apriori-based" algorithms are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCAGORDIAN combines the advantages of GORDIAN and our new algorithm HCA, and it significantly outperforms all previous work in many situations.