Sismanis, Y. and Brown, Paul and Haas, P.J. and Reinwald, B. (2006) GORDIAN: efficient and scalable discovery of composite keys. In: Dayal, U. and Whang, K.Y. and Lomet, D.B. and Alonso, G. and Lohman, G.M. and Kersten, M.L. and Kim, Y.-K. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases. Association for Computing Machinery, pp. 691-702.
Abstract
Identification of (composite) key attributes is of fundamental importance for many different data management tasks such as data modeling, data integration, anomaly detection, query formulation, query optimization, and indexing. However, information about keys is often missing or incomplete in many real-world database scenarios. Surprisingly, the fundamental problem of automatic key discovery has received little attention in the existing literature. Existing solutions ignore composite keys, due to the complexity associated with their discovery. Even for simple keys, current algorithms take a brute-force approach; the resulting exponential CPU and memory requirements limit the applicability of these methods to small datasets. In this paper, we describe GORDIAN, a scalable algorithm for automatic discovery of keys in large datasets, including composite keys. GORDIAN can provide exact results very efficiently for both real-world and synthetic datasets. GORDIAN can be used to find (composite) key attributes in any collection of entities, e.g., key column-groups in relational data, or key leaf-node sets in a collection of XML documents with a common schema. We show empirically that GORDIAN can be combined with sampling to efficiently obtain high quality sets of approximate keys even in very large datasets.
Metadata
Item Type: | Book Section |
---|---|
School: | Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences |
Depositing User: | Sarah Hall |
Date Deposited: | 23 Feb 2021 19:45 |
Last Modified: | 09 Aug 2023 12:50 |
URI: | https://eprints.bbk.ac.uk/id/eprint/43158 |
Statistics
Additional statistics are available via IRStats2.