SAMPLING-BASED DEDUPLICATION ESTIMATION

Danny Harnik (Inventor), David Chambliss (Inventor), Dmitry Sotnikov (Inventor), Danny Harnik (Inventor), David Chambliss (Inventor), Oded Margalit (Inventor), Dmitry Sotnikov (Inventor)

Research output: Patent

Abstract

A method, including partitioning a dataset into a first number of data units, and selecting, based on a sampling ratio, a second number of the data units. A hash value is calculated for each of the selected data units, and a first histogram is computed indicating a first duplication count for each of the calculated hash values. Based on respective frequencies of the calculated hash values, a second histogram is computed indicating an observed frequency for each of the first duplication counts in the first histogram, and based on the sampling ratio and the second histogram, a target function is derived. A third histogram that minimizes the target function is derived, the third histogram including, for the first number of the storage units, second duplication counts and a respective predicted frequency for each of the second duplication counts. Finally, a deduplication ratio is determined based on the third histogram.

Original languageEnglish
Patent numberUS2017199895
IPCG06F 17/ 30 A I
Priority date13/01/16
StatePublished - 13 Jul 2017

Fingerprint

Dive into the research topics of 'SAMPLING-BASED DEDUPLICATION ESTIMATION'. Together they form a unique fingerprint.

Cite this