# Substitution Cipher – how to crack it automatically

Getting a computer to crack a Caesar cipher automatically, as discussed here, is straightforward:

1. Get the computer to generate the 26 different possible versions of the message, using the 26 different shift sizes ...
2. ... and then count the number of times the ETAOIN and VKJXQZ letters occur to generate a score for each message.
3. Pick the message with the highest score from the 26 possible messages.

There are problems with both of the first two steps if trying this strategy with the substitution cipher:

1. As already mentioned, there are many, many more than 26 possibilites with a subsitution cipher. It is simply not practical for a computer to generate them all.
2. The ETAOIN and VKJXQZ method will not work with a substitution cipher. With a Caesar cipher, the message out of the twenty-six possible messages that had the most ETAOIN letters and the fewest VKJXQZ letters was the one most likely to be the correct English version. But, with a substitution cipher, this will not be enough to distinguish it from some of the other messages, and this is because some of the other messages will have the same ETAOIN/VKJXQZ score and yet will still be wrong. Let's say that the computer randomly generates one key from the vast numbers of keys that happens to be very close to the actual key - it has got twenty-four letters correct, but the 'T' and 'A' are the wrong way round. The ETAOIN total for this nearly correct version will be exactly the same as the ETAOIN total for the correct message / key (the VKJXQZ totals will be identical as well). In other words, the computer will be unable to choose between these two possible messages using the ETAOIN scoring system. In fact, the situation could be worse: twenty letters could be correct, but all the ETAOIN letters could have swapped with each other. This message will also get the same ETAOIN score as the actual message!

And this is what the paragraph from the last bullet-point above looks like if the ETAOIN letters have been swapped around (note how the fourth word of 'VKJXQZ' is unchanged):

AHT TAOINE OED VKJXQZ MTAHID WNLL EIA WIRK WNAH O SUBSANAUANIE CNPHTR. WNAH O COTSOR CNPHTR, AHT MTSSOGT IUA IF AHT AWTEAY-SNX PISSNBLT MTSSOGTS AHOA HOD AHT MISA TAOINE LTAATRS OED AHT FTWTSA VKJXQZ LTAATRS WOS MISA LNKTLY AI BT AHT CIRRTCA TEGLNSH VTRSNIE. BUA, WNAH O SUBSANAUANIE CNPHTR, AHT CIMPUATR MOY ROEDIMLY GTETROAT IET KTY FRIM AHT VOSA EUMBTRS IF KTYS AHOA HOPPTES AI BT VTRY CLIST AI AHT OCAUOL KTY - NA HOS GIA AWTEAY-FIUR LTAATRS CIRRTCA, BUA AHT 'A' OED 'O' ORT AHT WRIEG WOY RIUED. AHT TAOINE AIAOL FIR AHNS ETORLY CIRRTCA VTRSNIE WNLL BT TXOCALY AHT SOMT OS AHT TAOINE AIAOL FIR AHT CIRRTCA MTSSOGT / KTY (AHT VKJXQZ AIAOLS WNLL BT NDTEANCOL OS WTLL). NE IAHTR WIRDS, AHT CIMPUATR WNLL BT UEOBLT AI CHIIST BTAWTTE AHTST AWI PISSNBLT MTSSOGTS USNEG AHT TAOINE MTSSOGT. NE FOCA, AHT SNAUOANIE CIULD BT WIRST: AWTEAY LTAATRS CIULD BT CIRRTCA, BUA OLL AHT TAOINE LTAATRS CIULD HOVT SWOPPTD WNAH TOCH IAHTR. AHNS MTSSOGT WNLL OLSI GTA AHT SOMT TAOINE SCIRT OS AHT OCAUOL MTSSOGT!

As you can see, with only six letters out of place (albeit the six most important letters ...), the message appears to be near-gibberish - see how much of it you can still make sense of! We will need, therefore, to develop some alternative strategies for cracking a substitution cipher on a computer.

## 1. Coping with the problem of too many possibilities

Rather than randomly working through all the possible ways in which the 26 letters of the alphabet can be arranged, we will proceed as follows:

1. Give the computer an initial 'educated guess' as to what might be the answer. Call this the 'bestAnswer'.
2. Get the computer to make a small random adjustment to this 'bestAnswer', e.g. by randomly swapping two of the letters in the key / alphabet.
3. Compare the score for this new message with the original guess. Whichever is higher is the 'bestAnswer'. [We will deal with the problem of how to work out the score in section 2 below.]
4. Repeat steps 2 and 3 until the computer completes a run of (say) 1000 comparisons in which the 'bestAnswer' solution remains unchanged. At this point, it may have got very close to the actual answer, perhaps even the actual answer itself. If the message is sufficiently long (e.g. a few sentences) then there is a good chance that it will have got the actual solution, although occasionally it may get some of the least frequently occurring letters (e.g. 'J' and 'Z') the wrong way round, but these can be quickly corrected manually.

This method tries to steer the computer intelligently through the many possibilities - every change that leads to a decrease in the score is rejected, whereas every positive change is accepted. Future positive changes will now lead to an even greater positive score and so on. It is not perfect, and the analogy of climbing mountains is often used to explain why. If somebody wants to climb as high as possible, then to tell them to move only in a direction which leads to an increase in their height will not work very well. They may get to the top of the nearest mountain, but to get even higher they will have to go in a negative direction (i.e. downhill) first. The method above may fail for this reason: the initial answer may be so badly wrong that it will need to be changed drastically, even in ways which will initially reduce its score, before real progress can be made. However, the way this website cracks substitution ciphers is built on roughly this algorithm and, if you experiment with it, you will see that it is largely successful, especially for messages of a few sentences or more.

Before we look at how to calculate the score in step 3, we need to deal with the problem of generating an initial educated guess.

#### Generating an initial 'educated guess'

A logical initial 'guess' is to draw up a solution based on the letter frequencies in the encoded message. Let's say that the six most frequently occurring letters in the encoded message are JNWFDL, with J being the most frequent letter. Our initial guess assumes that the original message in English perfectly resembles the letter frequencies in typical English. In the section on letter frequencies, we saw that the six most frequently occurring letters in typical English are ETAOIN. So, for this example, we assume that E has been substituted for a J, T for an N, A for a W, O for an F, I for a D, and N for an L. In fact, there is no need to stop at six letters: we can match the 26 letter frequencies in the encoded message with all 26 frequencies of typical English (e t a o i n s r h l d c u m f p g w y b v k x j q z).

This initial guess will probably not be exactly right (it is likely to be 'more right' for longer messages), but it should be fairly close, and the process described in steps 2, 3 and 4 above will get the computer even closer still - and often to exactly the right answer.

This website makes it easy for you to see the letter frequencies for any message that you enter. Navgiate to either the Create Codes or Crack Codes pages, enter a message in the Input Box and then either press TAB or click somewhere else on the screen. The letter frequencies will appear automatically in the sidebar on the left of the screen.

For further information about letter frequenices, see this article on Wikipedia.

## 2. Working out a score for each possible message

The method described above will only work if the computer can reliably score each possible message. As the computer makes a small adjustment to its current 'bestAnswer' by randomly swapping two letters, it will lead to a slightly different message. Our scoring system needs to let the computer know whether this is a slight change for the better or not.

For the reasons given above, the ETAOIN scoring system will not work. A scoring system based on loading the words from a dictionary will not work much better either. This would award points if a whole word is identified, but this might happen accidentally with incorrect solutions and, if our initial guess is some way from the actual solution, then it may not be able to tell us whether the computer is heading in the right direction or not, and this is because any adjustments might not increase or decrease the number of words in the solution. The computer, therefore, will be making small changes blindly, unaware of whether it is homing in on the right solution or not. If the cipher text has also hidden the original word lengths then this will also add to the complexity of implementing this method.

Fortunately, there is another way we can use and it is, again, based on letter frequencies. Rather than thinking about individual letter frequencies (i.e. e t a o i n s r h l d c u m f p g w y b v k x j q z), let us concentrate on the frequencies of bigrams (two letter pair, e.g. 'DN') in English.

With a 26 letter alphabet, there are 26 x 26 = 676 bigrams in total: AA, AB, AC ... AZ, BA, BB, BC ... BZ ... ZX, ZY, ZZ. Imagine that a computer has been fed hundreds of thousands of pages of English, with all the spaces and punctuation between the words removed, and it was told to count up how often each of these 676 bigrams appeared. We could generate a list that started with the most frequent ones: TH, HE, IN, ER, AN ..., and finished with the rarest ones (e.g. ZJ, QX).

The computer can use this list to work out a score for each possible message by working through each bigram in the message and giving it points based on its tally in the bigram list.

Because the bigram list has been generated based on a large of sample of text without any spaces, we can use it on an encoded message which has had its word spacings removed. Also, because it is focused on small chunks of text, rather than whole words, it should be sensitive enough to work out whether even a single swap of two letters is a step in the right direction or not: if the swap leads to more 'English' bigrams in the solution, then this is likely to be a step in the right direction.

Instead of using bigrams, you could also use trigrams (3 letter sequences). There are obviously many more of these: 26 x 26 x 26 = 17,576. You could even try quadgrams (4 letter sequences): there are 26 x 26 x 26 x 26 = 456,976 possible quadgrams.

Whether bigrams, trigrams or quadgrams are used, some thought will need to be given as to how to optimise your code so that it can calculate these scores efficiently. Whenever a computer is used to crack a cipher, it will work through a huge number of possible solutions and, for each of these, it will work out the score, so an efficient scoring algorithm is vital.

For example, let's say the cipher text is 401 characters long. This message contains 400 bigrams: the first one starts at the first character, the second one (overlapping with the first one) starts at the second character, the third at the third character, and so on, all the way up to the penultimate character. The only place where a bigram does not start is at the very end, with the last character. For each of these 400 bigrams, the list of 676 bigram scores will need to be consulted to find the points for that bigram, and then the points for all 400 bigrams need to be added together to generate the overall score. Let's say that your computer will need to check c.10,000 possible solutions to crack a certain cipher. This will mean that it will need to find out the points for 400 x 10,000 = 4,000,000 bigrams! Hence the need for a speedy scoring algorithm ...

The scoring process that I have used in the automatic code-cracking section of this website is based on the method described on this page. It is not perfect, and this is because of the 'hill-climbing' problem described above (although in the background I have implemented some other code that sometimes gets around this problem), but, if you experiment with it, you will see that it is often very successful, especially once the messages are a paragraph or longer in length.

Cipher Challenge competition    Leave feedback