Thought this was cool: Streaming Data Mining Tutorial slides (and more)
We present a refined analysis of the classic Count-Sketch streaming heavy hitters algorithm [CCF02]. Count-Sketch uses O(k log n) linear measurements of a vector x in R^n to give an estimate x’ of x. The standard analysis shows that this estimate x’ satisfies ||x’-x||_infty^2 < ||x_tail||_2^2 / k, where x_tail is the vector containing all but the largest k coordinates of x. Our main result is that most of the coordinates of x’ have substantially less error than this upper bound; namely, for any c < O(log n), we show that each coordinate i satisfies(x’_i – x_i)^2 < (c/log n) ||x_tail||_2^2/k with probability 1-2^{-c}, as long as the hash functions are fully independent. This subsumes the previous bound. Our improved point estimates also give better results for l_2 recovery of exactly k-sparse estimates x^* when x is drawn from a distribution with suitable decay, such as a power law.Our proof shows that any random variable with positive real Fourier transform and finite variance concentrates around 0 at least as well as a Gaussian. This result, which may be of independent interest, gives good concentration even when the noise does not converge to a Gaussian.
For additional reference:
Already earlier this month, we saw this subject area in different meetings and blog entries:
Let us first recall that in Coding, Complexity, and Sparsity Workshop Slides, Piotr Indyk presented a Tutorial on compressive sensing/streaming algorithms.
More recent blog entries by Andrew McGregor, Suresh Venkatasubramanian and Anna Gilbert summarized the recent Dortmund meeting:
- The MMDS 2012 Slides are out! Workshop on Algorithms for Modern Massive Data Sets
- Sublinear Algorithms 2011 Presentation Slides.
In particular in Part 3: Videos of the UC Berkeley Conference: “From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications”
Petros Drineas talked about Randomized Algorithms in Data Mining: a Linear Algebraic Approach
We recently also covered similar work on Nuit Blanche on the subject of Randomization and Linear Algebra:
- Fast approximation of matrix coherence and statistical leverage, Michael Mahoney, Petros Drineas, Malik Magdon-Ismail, David Woodruff
- Parallel implementation of a fast randomized algorithm for the decomposition of low rank matrices by Andrew Lucas, Mark Stalzer, John Feo .