Notes

Chapter 10: Processes of Perception and Analysis

Section 12: Human Thinking

Hashing

Given data in the form of sequences of numbers between 0 and k - 1, a very simple hashing scheme is just to compute FromDigits[Take[list, n], k]. But for data corresponding, say, to English words this scheme yields a very nonuniform distribution of hash codes, since, for example, there are many words beginning with "ba", but none beginning with "bb". The slightly modified but still very simple scheme Mod[FromDigits[list, k], m], where m is usually chosen to be a prime, is what is most often used in practice. For a fair fraction of values of m, the hash codes obtained from this scheme change whenever any element of list is changed. If m = k ^s-1 then it turns out that interchanging a pair of adjacent length s blocks in list never affects the result. Out of the many hundreds of times that I have used hashing in practice, I recall only a couple of cases where schemes like the one just described were not adequate, and in these cases the data always turned out to have quite dramatic regularities.

In typical applications hash codes give locations in computer memory, from which actual data is found either by following a chain of pointers, or by probing successive locations until an empty one is reached. In the internals of Mathematica the most common way that hashing is used is for recognizing data and finding unique stored versions of it. There are several subtleties associated with setting up hash codes that appropriately handle approximate real numbers and Mathematica patterns.

Hashing is a sufficiently simple idea that it has been invented independently many times since at least the 1950s. The main alternative to hashing is to store data with successive elements corresponding to successive levels in a tree. In the past decade, hashing has become widely used not only for searching but also for authentication. The basic idea in this case is to take a document and to compute from it a small hash code that changes when almost any change is made in the document, and for which it is a difficult problem of cryptanalysis to work out what changes in the document will lead to no change in the hash code. Schemes for such hash codes can fairly easily be constructed using rule 30 and other cellular automata.