Patterns count-based labels for datasets

Yuval Moskovitch, H. V. Jagadish

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Counts of attribute-value combinations are central to the profiling of a data set, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some data set profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label"of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected attribute-value combinations (which we call "patterns") in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE 37th International Conference on Data Engineering, ICDE 2021
PublisherInstitute of Electrical and Electronics Engineers
Pages1961-1966
Number of pages6
ISBN (Electronic)9781728191843
DOIs
StatePublished - 1 Apr 2021
Externally publishedYes
Event37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Chania, Greece
Duration: 19 Apr 202122 Apr 2021

Publication series

NameProceedings - International Conference on Data Engineering
Volume2021-April
ISSN (Print)1084-4627

Conference

Conference37th IEEE International Conference on Data Engineering, ICDE 2021
Country/TerritoryGreece
CityVirtual, Chania
Period19/04/2122/04/21

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Patterns count-based labels for datasets'. Together they form a unique fingerprint.

Cite this