CipherTools

Cracking a simple Transposition Cipher

Bigram frequencies

It is well known that certain letters, such as 'E' and 'T', appear more frequently in English than other letters (see here for more details on individual letter frequencies), and, unsurprisingly, it turns out that certain bigrams (pairs of letters, such as 'AC') appear more frequently than other bigrams.

The spreadsheet excerpt below is based on the complete works of Shakespeare and also some of the works by Charles Dickens. In total, it counted over 5,000,000 bigrams to generate the statistics that you can see. The 45 most frequent bigrams are displayed and if you click on the download icon, you can download the whole spreadsheet which contains the frequencies for all 676 possible bigrams.

This message has been created using a simple transposition cipher and the key of 3-1-4-2:

MIYNI ONPNI TORHE ENIOS ITNHU GRSSP IIANB GTOSU EOBMR IAGAM PSAPR EGIMN EOFRQ RUETE LNHYA TTNHO SEIRN NGESL HIVGE IHNAT NTDII IDVLU LATEE TARPS APRET WHIFD FIEEN RRTEF EQNUE CSI

Although the key is 3-1-4-2, this is for converting the plaintext into the ciphertext; the key to go in the other direction is 2-4-1-3, and this is the key that we shall be trying to find below.

First, we shall assume that we have tested without success all keys that are 2 or 3 digits long, and we have now got to keys that are 4 digits long. One option is to brute force all of them, but we are going to use bigram frequencies instead, as this method will also enable us to find longer keys much more quickly.

To start this process, we will arrange the ciphertext into rows, with each row being the size of the key length being considered (i.e. 4 in this case):

  1 2 3 4
1 M I Y N
2 I O N P
3 N I T O
4 R H E E
5 N I O S
6 I T N H
7 U G R S
8 S P I I
9 A N B G
10 T O S U
11 E O B M
12 R I A G
13 A M P S
14 A P R E
15 G I M N
16 E O F R
17 Q R U E
18 T E L N
19 H Y A T
20 T N H O
21 S E I R
22 N N G E
23 S L H I
24 V G E I
25 H N A T
26 N T D I
27 I I D V
28 L U L A
29 T E E T
30 A R P S
31 A P R E
32 T W H I
33 F D F I
34 E E N R
35 R T E F
36 E Q N U
37 E C S I

We can reconstruct the ciphertext (not plaintext) from this table by just reading from left-to-right and then down the rows.

We can reconstuct the plaintext by first putting the columns in the correct order and then reading from left-to-right and then down the rows.

But what is the correct order of the columns? Let's start with a simpler question: what column should come after column 1? Is it column 2, 3 or 4?

We can try out each pairing of 1-2, 1-3 and 1-4 and then look down the pair of columns to see the bigrams that have been generated. We can then consult our data about bigram frequencies to see which of the pairings generates the bigrams that occur most frequently in English. In the table below, I have tried out the three pairings of 1-1, then 1-2 and finally 1-3, and then I have added the bigram score from my Shakespeare / Dickens data mentioned above and I have placed the total of all these scores for that column in the yellow cell near the top:

1 2 Bigram Score Total 1 3 Bigram Score Total 1 4 Bigram Score Total
M I MI 12144 674,363 M Y MY 15059 1,567,009 M N MN 1429 766,050
I O IO 13191   I N IN 78122   I P IP 3449  
N I NI 14716   N T NT 46450   N O NO 36923  
R H RH 6595   R E RE 64543   R E RE 64543  
N I NI 14716   N O NO 36923   N S NS 17110  
I T IT 42272   I N IN 78122   I H IH 2959  
U G UG 6391   U R UR 28685   U S US 25486  
S P SP 11908   S I SI 25692   S I SI 25692  
A N AN 85411   A B AB 6792   A G AG 8950  
T O TO 50246   T S TS 17129   T U TU 7863  
E O EO 15362   E B EB 11705   E M EM 23706  
R I RI 28716   R A RA 25920   R G RG 5366  
A M AM 15034   A P AP 6877   A S AS 34501  
A P AP 6877   A R AR 46721   A E AE 1252  
G I GI 6028   G M GM 1444   G N GN 2251  
E O EO 15362   E F EF 14952   E R ER 86115  
Q R QR 0   Q U QU 4970   Q E QE 0  
T E TE 36612   T L TL 8707   T N TN 4096  
H Y HY 8078   H A HA 63296   H T HT 15119  
T N TN 4096   T H TH 165636   T O TO 50246  
S E SE 38283   S I SI 25692   S R SR 2273  
N N NN 3990   N G NG 38790   N E NE 31233  
S L SL 6204   S H SH 29271   S I SI 25692  
V G VG 4   V E VE 34516   V I VI 7310  
H N HN 1861   H A HA 63296   H T HT 15119  
N T NT 46450   N D ND 66738   N I NI 14716  
I I II 884   I D ID 12330   I V IV 7312  
L U LU 4498   L L LL 38393   L A LA 20757  
T E TE 36612   T E TE 36612   T T TT 26164  
A R AR 46721   A P AP 6877   A S AS 34501  
A P AP 6877   A R AR 46721   A E AE 1252  
T W TW 12173   T H TH 165636   T I TI 33680  
F D FD 841   F F FF 5999   F I FI 10187  
E E EE 27655   E N EN 60950   E R ER 86115  
R T RT 26758   R E RE 64543   R F RF 5274  
E Q EQ 1434   E N EN 60950   E U EU 4431  
E C EC 19363   E S ES 61950   E I EI 22978  

The three bigram scores are as follows:

  • Column 1 followed by column 2: 674,363
  • Column 1 followed by column 3: 1,567,009 *
  • Column 1 followed by column 4: 766,050

We can see that we get a much higher bigram score if column 1 is followed by column 3. We do not know the final key, but we can be pretty confident that, in the final key, 1 is followed by 3 as this has generated the bigrams most resembling normal English text.

If we repeat this process for column 2, which can be followed by column 1, 3 or 4, we get the following scores:

  • Column 2 followed by column 1: 700,773
  • Column 2 followed by column 3: 702,737
  • Column 2 followed by column 4: 1,394,534 *

We can see that we get a much higher bigram score if column 2 is followed by column 4. Let's repeat this process for the last two columns 3 and 4:

  • Column 3 followed by column 1: 874,800
  • Column 3 followed by column 2: 618,131
  • Column 3 followed by column 4: 736,740
  • Column 4 followed by column 1: 1,455,940 *
  • Column 4 followed by column 2: 737,675
  • Column 4 followed by column 3: 643,910

There is no clear answer from column 3, but there is for column 4 - it should be followed by column 1. So we have established the following:

  • Column 1 is followed by column 3
  • Column 2 is followed by column 4
  • Column 4 is followed by column 1

If we put these together, we can see that we get the key: 2,4,1,3, which is the key mentioned near the top of this page.

Why was there no clear answer for column 3? This is because column 3 is at the end of the key, so it is not followed by a letter from any of the other columns on that row - so none of the pairings generated a typical set of English bigrams. Instead, it is followed by a letter from the next row.

The power of the bigram frequency method

I have used a key length of just 4 on this page. This has made it easier to illustrate how it is used but it has hidden its power if we compare the bigram frequency method with a brute force method. For a key length of 4:

  • BRUTE FORCE: there are 4! (i.e. 4 x 3 x 2 x 1 = 24) possible keys to test - 1234, 1243, 1324, 1342, 1423, 1432, 2134, 2143, 2314, 2341, 2413, 2431, 3124, 3142, 3214, 3241, 3412, 3421, 4123, 4132, 4213, 4231, 4312, 4321. We could generate each key, apply it to the ciphertext and see if we get an English plaintext.
  • BIGRAM FREQUENCY: for each of the 4 columns, we tried out 3 pairings to see which gave the best score. So, for column 1, we looked at the pairings of 1-2, 1-3 and 1-4. We then used the best scores to reconstruct the key. We looked, therefore, at 4 x 3 = 12 possibilities.

We are not looking at exactly the same thing with each method, but this will not matter if there is a vast difference between the number of possibilities that we need to look at in each case. Let's look at the equivalent figures for a key length of 12:

  • BRUTE FORCE: there are 12! possible keys to test - 12 x 11 x 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 479,001,600 keys.
  • BIGRAM FREQUENCY: for each of the 12 columns, we need to try out 11 pairings to see which gives the best score. We need to look, therefore, at 12 x 11 = 132 possibilities.

When the key length is 12, the brute force method requires us to look at nearly 4,000,000 more possibilities than the bigram frequency method!

Cipher Challenge competition    Leave feedback