Estimation of deduplication ratios in large data sets

Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov, Gil Vernik

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

32 Scopus citations

Abstract

We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios.

Original languageEnglish
Title of host publication2012 IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST 2012
DOIs
StatePublished - 18 Sep 2012
Externally publishedYes
Event2012 IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST 2012 - Pacific Grove, CA, United States
Duration: 16 Apr 201220 Apr 2012

Publication series

NameIEEE Symposium on Mass Storage Systems and Technologies
ISSN (Print)2160-1968

Conference

Conference2012 IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST 2012
Country/TerritoryUnited States
CityPacific Grove, CA
Period16/04/1220/04/12

Fingerprint

Dive into the research topics of 'Estimation of deduplication ratios in large data sets'. Together they form a unique fingerprint.

Cite this