TY - GEN
T1 - Beating CountSketch for heavy hitters in insertion streams
AU - Braverman, Vladimir
AU - Ivkin, Nikita
AU - Chestnut, Stephen R.
AU - Woodruff, David P.
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/19
Y1 - 2016/6/19
N2 - Given a stream p1,..., pm of items from a universe U, which, without loss of generality we identify with the set of integers {1,2,...,n}, we consider the problem of returning all l2-heavy hitters, i.e., those items j for which fj ≥ ϵ√F2, where fj is the number of occurrences of item j in the stream, and F2 = Σi∈[n] f2i. Such a guarantee is considerably stronger than the l1-guarantee, which finds those j for which fj ≥ ϵm. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using Θ(log2 n) bits of space (for constant ϵ > 0). The only known lower bound is Ω(log n) bits of space, which comes from the need to specify the identities of the items found. In this paper we show one can achieve O(log n log log n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including: (1) The first algorithm for estimating F2 simultaneously at all points in a stream using only O(log n log log n) bits of space, improving a natural union bound. (2) A way to estimate the l∞ norm of a stream up to additive error ϵ√F2 with O (log n log log n) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.
AB - Given a stream p1,..., pm of items from a universe U, which, without loss of generality we identify with the set of integers {1,2,...,n}, we consider the problem of returning all l2-heavy hitters, i.e., those items j for which fj ≥ ϵ√F2, where fj is the number of occurrences of item j in the stream, and F2 = Σi∈[n] f2i. Such a guarantee is considerably stronger than the l1-guarantee, which finds those j for which fj ≥ ϵm. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using Θ(log2 n) bits of space (for constant ϵ > 0). The only known lower bound is Ω(log n) bits of space, which comes from the need to specify the identities of the items found. In this paper we show one can achieve O(log n log log n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including: (1) The first algorithm for estimating F2 simultaneously at all points in a stream using only O(log n log log n) bits of space, improving a natural union bound. (2) A way to estimate the l∞ norm of a stream up to additive error ϵ√F2 with O (log n log log n) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.
KW - Chaining
KW - Data streams
KW - Heavy hitters
UR - https://www.scopus.com/pages/publications/84979306895
U2 - 10.1145/2897518.2897558
DO - 10.1145/2897518.2897558
M3 - Conference contribution
AN - SCOPUS:84979306895
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 740
EP - 753
BT - STOC 2016 - Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing
A2 - Mansour, Yishay
A2 - Wichs, Daniel
PB - Association for Computing Machinery
T2 - 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016
Y2 - 19 June 2016 through 21 June 2016
ER -