Information regarding the counts of attributes combination is central to the profiling of a data set. It may reveal bias; it can help determine fitness for use. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. To this end, we present the notion of storing a “label” of limited size that can be used to obtain good estimates for these counts. A label contains information regarding the count of selected patterns–attributes values combinations–in the data. We define an estimation function, that uses this label to estimate the count of every pattern. Intuitively, there is a trade-off between the label size and its estimation error. We propose a demonstration of Countata, a system that allows the user to examine this trade-off as well as the label’s count information. We will demonstrate the usefulness of Countata using real-life data, and illustrate the effectiveness of our estimation paradigm.
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Computer Science (all)