It is well known that certain letters, such as 'E' and 'T', appear more frequently in English than other letters (see here for more details on individual letter frequencies), and, unsurprisingly, it turns out that certain bigrams (pairs of letters, such as 'AC') appear more frequently than other bigrams.
The spreadsheet excerpt below is based on the complete works of Shakespeare and also some of the works by Charles Dickens. In total, it counted over 5,000,000 bigrams to generate the statistics that you can see. The 45 most frequent bigrams are displayed and if you click on the download icon, you can download the whole spreadsheet which contains the frequencies for all 676 possible bigrams.
This message has been created using a simple transposition cipher and the key of 3-1-4-2:
Although the key is 3-1-4-2, this is for converting the plaintext into the ciphertext; the key to go in the other direction is 2-4-1-3, and this is the key that we shall be trying to find below.
First, we shall assume that we have tested without success all keys that are 2 or 3 digits long, and we have now got to keys that are 4 digits long. One option is to brute force all of them, but we are going to use bigram frequencies instead, as this method will also enable us to find longer keys much more quickly.
To start this process, we will arrange the ciphertext into rows, with each row being the size of the key length being considered (i.e. 4 in this case):
1 | 2 | 3 | 4 | |
1 | M | I | Y | N |
2 | I | O | N | P |
3 | N | I | T | O |
4 | R | H | E | E |
5 | N | I | O | S |
6 | I | T | N | H |
7 | U | G | R | S |
8 | S | P | I | I |
9 | A | N | B | G |
10 | T | O | S | U |
11 | E | O | B | M |
12 | R | I | A | G |
13 | A | M | P | S |
14 | A | P | R | E |
15 | G | I | M | N |
16 | E | O | F | R |
17 | Q | R | U | E |
18 | T | E | L | N |
19 | H | Y | A | T |
20 | T | N | H | O |
21 | S | E | I | R |
22 | N | N | G | E |
23 | S | L | H | I |
24 | V | G | E | I |
25 | H | N | A | T |
26 | N | T | D | I |
27 | I | I | D | V |
28 | L | U | L | A |
29 | T | E | E | T |
30 | A | R | P | S |
31 | A | P | R | E |
32 | T | W | H | I |
33 | F | D | F | I |
34 | E | E | N | R |
35 | R | T | E | F |
36 | E | Q | N | U |
37 | E | C | S | I |
We can reconstruct the ciphertext (not plaintext) from this table by just reading from left-to-right and then down the rows.
We can reconstuct the plaintext by first putting the columns in the correct order and then reading from left-to-right and then down the rows.
But what is the correct order of the columns? Let's start with a simpler question: what column should come after column 1? Is it column 2, 3 or 4?
We can try out each pairing of 1-2, 1-3 and 1-4 and then look down the pair of columns to see the bigrams that have been generated. We can then consult our data about bigram frequencies to see which of the pairings generates the bigrams that occur most frequently in English. In the table below, I have tried out the three pairings of 1-1, then 1-2 and finally 1-3, and then I have added the bigram score from my Shakespeare / Dickens data mentioned above and I have placed the total of all these scores for that column in the yellow cell near the top:
1 | 2 | Bigram | Score | Total | 1 | 3 | Bigram | Score | Total | 1 | 4 | Bigram | Score | Total | ||
M | I | MI | 12144 | 674,363 | M | Y | MY | 15059 | 1,567,009 | M | N | MN | 1429 | 766,050 | ||
I | O | IO | 13191 | I | N | IN | 78122 | I | P | IP | 3449 | |||||
N | I | NI | 14716 | N | T | NT | 46450 | N | O | NO | 36923 | |||||
R | H | RH | 6595 | R | E | RE | 64543 | R | E | RE | 64543 | |||||
N | I | NI | 14716 | N | O | NO | 36923 | N | S | NS | 17110 | |||||
I | T | IT | 42272 | I | N | IN | 78122 | I | H | IH | 2959 | |||||
U | G | UG | 6391 | U | R | UR | 28685 | U | S | US | 25486 | |||||
S | P | SP | 11908 | S | I | SI | 25692 | S | I | SI | 25692 | |||||
A | N | AN | 85411 | A | B | AB | 6792 | A | G | AG | 8950 | |||||
T | O | TO | 50246 | T | S | TS | 17129 | T | U | TU | 7863 | |||||
E | O | EO | 15362 | E | B | EB | 11705 | E | M | EM | 23706 | |||||
R | I | RI | 28716 | R | A | RA | 25920 | R | G | RG | 5366 | |||||
A | M | AM | 15034 | A | P | AP | 6877 | A | S | AS | 34501 | |||||
A | P | AP | 6877 | A | R | AR | 46721 | A | E | AE | 1252 | |||||
G | I | GI | 6028 | G | M | GM | 1444 | G | N | GN | 2251 | |||||
E | O | EO | 15362 | E | F | EF | 14952 | E | R | ER | 86115 | |||||
Q | R | QR | 0 | Q | U | QU | 4970 | Q | E | QE | 0 | |||||
T | E | TE | 36612 | T | L | TL | 8707 | T | N | TN | 4096 | |||||
H | Y | HY | 8078 | H | A | HA | 63296 | H | T | HT | 15119 | |||||
T | N | TN | 4096 | T | H | TH | 165636 | T | O | TO | 50246 | |||||
S | E | SE | 38283 | S | I | SI | 25692 | S | R | SR | 2273 | |||||
N | N | NN | 3990 | N | G | NG | 38790 | N | E | NE | 31233 | |||||
S | L | SL | 6204 | S | H | SH | 29271 | S | I | SI | 25692 | |||||
V | G | VG | 4 | V | E | VE | 34516 | V | I | VI | 7310 | |||||
H | N | HN | 1861 | H | A | HA | 63296 | H | T | HT | 15119 | |||||
N | T | NT | 46450 | N | D | ND | 66738 | N | I | NI | 14716 | |||||
I | I | II | 884 | I | D | ID | 12330 | I | V | IV | 7312 | |||||
L | U | LU | 4498 | L | L | LL | 38393 | L | A | LA | 20757 | |||||
T | E | TE | 36612 | T | E | TE | 36612 | T | T | TT | 26164 | |||||
A | R | AR | 46721 | A | P | AP | 6877 | A | S | AS | 34501 | |||||
A | P | AP | 6877 | A | R | AR | 46721 | A | E | AE | 1252 | |||||
T | W | TW | 12173 | T | H | TH | 165636 | T | I | TI | 33680 | |||||
F | D | FD | 841 | F | F | FF | 5999 | F | I | FI | 10187 | |||||
E | E | EE | 27655 | E | N | EN | 60950 | E | R | ER | 86115 | |||||
R | T | RT | 26758 | R | E | RE | 64543 | R | F | RF | 5274 | |||||
E | Q | EQ | 1434 | E | N | EN | 60950 | E | U | EU | 4431 | |||||
E | C | EC | 19363 | E | S | ES | 61950 | E | I | EI | 22978 |
The three bigram scores are as follows:
We can see that we get a much higher bigram score if column 1 is followed by column 3. We do not know the final key, but we can be pretty confident that, in the final key, 1 is followed by 3 as this has generated the bigrams most resembling normal English text.
If we repeat this process for column 2, which can be followed by column 1, 3 or 4, we get the following scores:
We can see that we get a much higher bigram score if column 2 is followed by column 4. Let's repeat this process for the last two columns 3 and 4:
There is no clear answer from column 3, but there is for column 4 - it should be followed by column 1. So we have established the following:
If we put these together, we can see that we get the key: 2,4,1,3, which is the key mentioned near the top of this page.
Why was there no clear answer for column 3? This is because column 3 is at the end of the key, so it is not followed by a letter from any of the other columns on that row - so none of the pairings generated a typical set of English bigrams. Instead, it is followed by a letter from the next row.
I have used a key length of just 4 on this page. This has made it easier to illustrate how it is used but it has hidden its power if we compare the bigram frequency method with a brute force method. For a key length of 4:
We are not looking at exactly the same thing with each method, but this will not matter if there is a vast difference between the number of possibilities that we need to look at in each case. Let's look at the equivalent figures for a key length of 12:
When the key length is 12, the brute force method requires us to look at nearly 4,000,000 more possibilities than the bigram frequency method!