TY - GEN
T1 - Patterns count-based labels for datasets
AU - Moskovitch, Yuval
AU - Jagadish, H. V.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/4/1
Y1 - 2021/4/1
N2 - Counts of attribute-value combinations are central to the profiling of a data set, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label"of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected attribute-value combinations (which we call "patterns") in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.
AB - Counts of attribute-value combinations are central to the profiling of a data set, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label"of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected attribute-value combinations (which we call "patterns") in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85112865775&partnerID=8YFLogxK
U2 - 10.1109/ICDE51399.2021.00184
DO - 10.1109/ICDE51399.2021.00184
M3 - Conference contribution
AN - SCOPUS:85112865775
T3 - Proceedings - International Conference on Data Engineering
SP - 1961
EP - 1966
BT - Proceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PB - Institute of Electrical and Electronics Engineers
T2 - 37th IEEE International Conference on Data Engineering, ICDE 2021
Y2 - 19 April 2021 through 22 April 2021
ER -