Patterns count-based labels for datasets

Yuval Moskovitch, H. V. Jagadish

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations


Counts of attribute-value combinations are central to the profiling of a data set, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label"of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected attribute-value combinations (which we call "patterns") in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PublisherInstitute of Electrical and Electronics Engineers
Number of pages6
ISBN (Electronic)9781728191843
StatePublished - 1 Apr 2021
Externally publishedYes
Event37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Chania, Greece
Duration: 19 Apr 202122 Apr 2021

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627


Conference37th IEEE International Conference on Data Engineering, ICDE 2021
CityVirtual, Chania

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems


Dive into the research topics of 'Patterns count-based labels for datasets'. Together they form a unique fingerprint.

Cite this