20.5 The Entropy of English – Introduction to Cryptography with Coding Theory, 3rd Edition

20.5 The Entropy of English

In an English text, how much information is obtained per letter? If we had a random sequence of letters, each appearing with probability 1/26, then the entropy would be log2(26)=4.70; so each letter would contain 4.7 bits of information. If we include spaces, we get log2(27)=4.75. But the letters are not equally likely: a has frequency .082, b has frequency .015, etc. (see Section 2.3). Therefore, we consider

(.082log2.082+.015log2.015+)=4.18.

However, this doesn’t tell the whole story. Suppose we have the sequence of letters we are studyin. There is very little uncertainty as to what the last letter is; it is easy to guess that it is g. Similarly, if we see the letter q, it is extremely likely that the next letter is u. Therefore, the existing letters often give information about the next letter, which means that there is not as much additional information carried by that letter. This says that the entropy calculated previously is still too high. If we use tables of the frequencies of the 262=676 digrams (a digram is a two-letter combination), we can calculate the conditional entropy of one letter, given the preceding letter, to be 3.56. Using trigram frequencies, we find that the conditional entropy of a letter, given the preceding two letters, is approximately 3.3. This means that, on the average, if we know two consecutive letters in a text, the following letter carries 3.3 bits of additional information. Therefore, if we have a long text, we should expect to be able to compress it at least by a factor of around 3.3/4.7=.7.

Let L represent the letters of English. Let LN denote the N-gram combinations. Define the entropy of English to be

HEnglish=limNH(LN)N, 

where H(LN) denotes the entropy of N-grams. This gives the average amount of information per letter in a long text, and it also represents the average amount of uncertainty in guessing the next letter, if we already know a lot of the text. If the letters were all independent of each other, so the probability of the digram qu equaled the probability of q times the probability of u, then we would have H(LN)=NH(L), and the limit would be H(L), which is the entropy for one-letter frequencies. But the interactions of letters, as noticed in the frequencies for digrams and trigrams, lower the value of H(LN).

How do we compute H(LN)? Calculating 100-gram frequencies is impossible. Even tabulating the most common of them and getting an approximation would be difficult. Shannon proposed the following idea.

Suppose we have a machine that is an optimal predictor, in the sense that, given a long string of text, it can calculate the probabilities for the letter that will occur next. It then guesses the letter with highest probability. If correct, it notes the letter and writes down a 1. If incorrect, it guesses the second most likely letter. If correct, it writes down a 2, etc. In this way, we obtain a sequence of numbers. For example, consider the text itissunnytoday. Suppose the predictor says that t is the most likely for the 1st letter, and it is wrong; its second guess is i, which is correct, so we write the i and put 2 below it. The predictor then predicts that t is the next letter, which is correct. We put 1 beneath the t. Continuing, suppose it finds i on its 1st guess, etc. We obtain a situation like the following:

itissunnytoday21114321411111

Using the prediction machine, we can reconstruct the text. The prediction machine says that its second guess for the first letter will be i, so we know the 1st letter is i. The predictor says that its first guess for the next letter is t, so we know that’s next. The first guess for the next is i, etc.

What this means is that if we have a machine for predicting, we can change a text into a string of numbers without losing any information, because we can reconstruct the text. Of course, we could attempt to write a computer program to do the predicting, but Shannon suggested that the best predictor is a person who speaks English. Of course, a person is unlikely to be as deterministic as a machine, and repeating the experiment (assuming the person forgets the text from the first time) might not yield an identical result. So reconstructing the text might present a slight difficulty. But it is still a reasonable assumption that a person approximates an optimal predictor.

Given a sequence of integers corresponding to a text, we can count the frequency of each number. Let

qi= frequency of the number i.

Since the text and the sequence of numbers can be reconstructed from each other, their entropies must be the same. The largest the entropy can be for the sequence of numbers is when these numbers are independent. In this case, the entropy is i=126qilog2(qi). However, the numbers are probably not independent. For example, if there are a couple consecutive 1s, then perhaps the predictor has guessed the rest of the word, which means that there will be a few more 1s. However, we get an upper bound for the entropy, which is usually better than the one we obtain using frequencies of letters. Moreover, Shannon also found a lower bound for the entropy. His results are

1=126i(qiqi+1)log2(i)HEnglishi=126qilog2(qi).

Actually, these are only approximate upper and lower bounds, since there is experimental error, and we are really considering a limit as N.

These results allow an experimental estimation of the entropy of English. Alice chooses a text and Bob guesses the first letter, continuing until the correct guess is made. Alice records the number of guesses. Bob then tries to guess the second letter, and the number of guesses is again recorded. Continuing in this way, Bob tries to guess each letter. When he is correct, Alice tells him and records the number of guesses. Shannon gave Table 20.1 as a typical result of an experiment. Note that he included spaces, but ignored punctuation, so he had 27 possibilities: There are 102 symbols. There are seventy-nine 1s, eight 2s, three 3s, etc. This gives

q1=79/102, q2=8/102, q3=3/102, q4=q5=2/102, q6=3/102, q7=q8=q11=q15=q17=1/102.

Table 20.1 Shannon’s Experiment on the Entropy of English

The upper bound for the entropy is therefore

(79102log279102++1102log21102)1.42.

Note that since we are using 0log20=0, the terms with qi=0 can be omitted. The lower bound is

1(791028102)log2(1)+2(81023102)log2(2)+.72.

A reasonable estimate is therefore that the entropy of English is near 1, maybe slightly more than 1.

If we want to send a long English text, we could write each letter (and the space) as a string of five bits. This would mean that a text of length 102, such as the preceding, would require 510 bits. It would be necessary to use something like this method if the letters were independent and equally likely. However, suppose we do a Huffman encoding of the message 1, 1, 1, 5, 1, 1, 2,  from Table 20.1. Let

1121103101040100511100600107011008110001101000151000017100000.

All other numbers up to 27 can be represented by various combinations of six or more bits. To send the message requires

791+83+34+24++16=171 bits, 

which is 1.68 bits per letter.

Note that five bits per letter is only slightly more than the “random” entropy 4.75, and 1.68 bits per letter is slightly more than our estimate of the entropy of English. These agree with the result that entropy differs from the average length of a Huffman encoding by at most 1.

One way to look at the preceding entropy calculations is to say that English is around 75% redundant. Namely, if we send a long message in standard written English, compared to the optimally compressed text, the ratio is approximately 4 to 1 (that is, the random entropy 4.75 divided by the entropy of English, which is around 1). In our example, we were close, obtaining a ratio near 3 to 1 (namely 4.75/1.68).

Define the redundancy of English to be

R=1HEnglishlog2(26).

Then R is approximately 0.75, which is the 75% redundancy mentioned previously.

20.5.1 Unicity Distance

Suppose we have a ciphertext. How many keys will decrypt it to something meaningful? If the text is long enough, we suspect that there is a unique key and a unique corresponding plaintext. The unicity distance n0 for a cryptosystem is the length of ciphertext at which one expects that there is a unique meaningful plaintext. A rough estimate for the unicity distance is

n0=log2|K|Rlog2|L|, 

where |K| is the number of possible keys, |L| is the number of letters or symbols, and R is the redundancy (see [Stinson]). We’ll take R=.75 (whether we include spaces in our language or not; the difference is small).

For example, consider the substitution cipher, which has 26! keys. We have

n0=log226!.75log22625.1.

This means that if a ciphertext has length 25 or more, we expect that usually there is only one possible meaningful plaintext. Of course, if we have a ciphertext of length 25, there are probably several letters that have not appeared. Therefore, there could be several possible keys, all of which decrypt the ciphertext to the same plaintext.

As another example, consider the affine cipher. There are 312 keys, so

n0=log2312.75log2262.35.

This should be regarded as only a very rough approximation. Clearly it should take a few more letters to get a unique decryption. But the estimate of 2.35 indicates that very few letters suffice to yield a unique decryption in most cases for the affine cipher.

Finally, consider the one-time pad for a message of length N. The encryption is a separate shift mod 26 for each letter, so there are 26N keys. We obtain the estimate

n0log226N.75log226=1.33N.

In this case, it says we need more letters than the entire ciphertext to get a unique decryption. This reflects the fact that all plaintexts are possible for any ciphertext.