Notes

Chapter 10: Processes of Perception and Analysis

Section 10: Cryptography and Cryptanalysis

[Redundancy in] text

As the picture below illustrates, English text typically remains intelligible until about half its characters have been deleted, indicating that it has a redundancy of around 0.5. Most other languages have slightly higher redundancies, making documents in those languages slightly longer than their counterparts in English.

Redundancy can in principle be estimated by breaking text into blocks of length b, then looking for the limit of the entropy as b  ∞ (see page 1084). Statistically uniform samples of text do not in practice, however, tend to be large enough to allow more than about b = 6 to be reached, and the presence of correlations (even though exponentially damped) between far-separated letters means that computed entropies usually decrease continually with b, making it difficult to estimate their limit (see page 1084). Note that particularly in computer languages higher redundancy is found if one takes account of grammatical structure.