Volume 9 (2013) Article 30 pp. 897-945
Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream
Revised: December 17, 2013
Published: December 31, 2013
[PDF (442K)]    [PS (2228K)]    [PS.GZ (506K)]
[Source ZIP]
Keywords: algorithms, hashing, extractors, derandomization, average case, pairwise independence, Bloom filters, linear probing, balanced allocations
ACM Classification: F.2
AMS Classification: 68W20, 68Q25, 68W40

Abstract: [Plain Text Version]

Hashing is fundamental to many algorithms and data structures widely used in practice. For the theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits (in the length of a data item) to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or $O(1)$-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing.

In practice, however, it is commonly observed that simple hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source,” whereby each new data item has some “entropy” given the previous ones. As long as the R\'enyi entropy per data item is sufficiently large, it turns out that the performance when choosing a hash function from a 2-universal family is essentially the same as for a truly random hash function. We describe results for several sample applications, including linear probing, chained hashing, balanced allocations, and Bloom filters.

Towards developing our results, we prove tight bounds for hashing block sources, determining the entropy required per block for the distribution of hashed values to be close to uniformly distributed.

The paper is a merge of the following two conference papers: Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream by M.M. and S.V., Proc. 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 746-755, 2008, and Tight Bounds for Hashing Block Sources by K-M.C., Proc. 11th International Workshop, APPROX 2008, and 12th International Workshop, RANDOM 2008, on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques, pages 357 - 370, 2008.