Getting a computer to crack a Caesar cipher automatically, as discussed here, is straightforward:
There are problems with both of the first two steps if trying this strategy with the substitution cipher:
And this is what the paragraph from the last bullet-point above looks like if the ETAOIN letters have been swapped around (note how the fourth word of 'VKJXQZ' is unchanged):
As you can see, with only six letters out of place (albeit the six most important letters ...), the message appears to be near-gibberish - see how much of it you can still make sense of! We will need, therefore, to develop some alternative strategies for cracking a substitution cipher on a computer.
Rather than randomly working through all the possible ways in which the 26 letters of the alphabet can be arranged, we will proceed as follows:
This method tries to steer the computer intelligently through the many possibilities - every change that leads to a decrease in the score is rejected, whereas every positive change is accepted. Future positive changes will now lead to an even greater positive score and so on. It is not perfect, and the analogy of climbing mountains is often used to explain why. If somebody wants to climb as high as possible, then to tell them to move only in a direction which leads to an increase in their height will not work very well. They may get to the top of the nearest mountain, but to get even higher they will have to go in a negative direction (i.e. downhill) first. The method above may fail for this reason: the initial answer may be so badly wrong that it will need to be changed drastically, even in ways which will initially reduce its score, before real progress can be made. However, the way this website cracks substitution ciphers is built on roughly this algorithm and, if you experiment with it, you will see that it is largely successful, especially for messages of a few sentences or more.
Before we look at how to calculate the score in step 3, we need to deal with the problem of generating an initial educated guess.
A logical initial 'guess' is to draw up a solution based on the letter frequencies in the encoded message. Let's say that the six most frequently occurring letters in the encoded message are JNWFDL, with J being the most frequent letter. Our initial guess assumes that the original message in English perfectly resembles the letter frequencies in typical English. In the section on letter frequencies, we saw that the six most frequently occurring letters in typical English are ETAOIN. So, for this example, we assume that E has been substituted for a J, T for an N, A for a W, O for an F, I for a D, and N for an L. In fact, there is no need to stop at six letters: we can match the 26 letter frequencies in the encoded message with all 26 frequencies of typical English (e t a o i n s r h l d c u m f p g w y b v k x j q z).
This initial guess will probably not be exactly right (it is likely to be 'more right' for longer messages), but it should be fairly close, and the process described in steps 2, 3 and 4 above will get the computer even closer still - and often to exactly the right answer.
This website makes it easy for you to see the letter frequencies for any message that you enter. Navgiate to either the Create Codes or Crack Codes pages, enter a message in the Input Box and then either press TAB or click somewhere else on the screen. The letter frequencies will appear automatically in the sidebar on the left of the screen.
For further information about letter frequenices, see this article on Wikipedia.
The method described above will only work if the computer can reliably score each possible message. As the computer makes a small adjustment to its current 'bestAnswer' by randomly swapping two letters, it will lead to a slightly different message. Our scoring system needs to let the computer know whether this is a slight change for the better or not.
For the reasons given above, the ETAOIN scoring system will not work. A scoring system based on loading the words from a dictionary will not work much better either. This would award points if a whole word is identified, but this might happen accidentally with incorrect solutions and, if our initial guess is some way from the actual solution, then it may not be able to tell us whether the computer is heading in the right direction or not, and this is because any adjustments might not increase or decrease the number of words in the solution. The computer, therefore, will be making small changes blindly, unaware of whether it is homing in on the right solution or not. If the cipher text has also hidden the original word lengths then this will also add to the complexity of implementing this method.
Fortunately, there is another way we can use and it is, again, based on letter frequencies. Rather than thinking about individual letter frequencies (i.e. e t a o i n s r h l d c u m f p g w y b v k x j q z), let us concentrate on the frequencies of bigrams (two letter pair, e.g. 'DN') in English.
With a 26 letter alphabet, there are 26 x 26 = 676 bigrams in total: AA, AB, AC ... AZ, BA, BB, BC ... BZ ... ZX, ZY, ZZ. Imagine that a computer has been fed hundreds of thousands of pages of English, with all the spaces and punctuation between the words removed, and it was told to count up how often each of these 676 bigrams appeared. We could generate a list that started with the most frequent ones: TH, HE, IN, ER, AN ..., and finished with the rarest ones (e.g. ZJ, QX).
The computer can use this list to work out a score for each possible message by working through each bigram in the message and giving it points based on its tally in the bigram list.
Because the bigram list has been generated based on a large of sample of text without any spaces, we can use it on an encoded message which has had its word spacings removed. Also, because it is focused on small chunks of text, rather than whole words, it should be sensitive enough to work out whether even a single swap of two letters is a step in the right direction or not: if the swap leads to more 'English' bigrams in the solution, then this is likely to be a step in the right direction.
Instead of using bigrams, you could also use trigrams (3 letter sequences). There are obviously many more of these: 26 x 26 x 26 = 17,576. You could even try quadgrams (4 letter sequences): there are 26 x 26 x 26 x 26 = 456,976 possible quadgrams.
Whether bigrams, trigrams or quadgrams are used, some thought will need to be given as to how to optimise your code so that it can calculate these scores efficiently. Whenever a computer is used to crack a cipher, it will work through a huge number of possible solutions and, for each of these, it will work out the score, so an efficient scoring algorithm is vital.
For example, let's say the cipher text is 401 characters long. This message contains 400 bigrams: the first one starts at the first character, the second one (overlapping with the first one) starts at the second character, the third at the third character, and so on, all the way up to the penultimate character. The only place where a bigram does not start is at the very end, with the last character. For each of these 400 bigrams, the list of 676 bigram scores will need to be consulted to find the points for that bigram, and then the points for all 400 bigrams need to be added together to generate the overall score. Let's say that your computer will need to check c.10,000 possible solutions to crack a certain cipher. This will mean that it will need to find out the points for 400 x 10,000 = 4,000,000 bigrams! Hence the need for a speedy scoring algorithm ...
The scoring process that I have used in the automatic code-cracking section of this website is based on the method described on this page. It is not perfect, and this is because of the 'hill-climbing' problem described above (although in the background I have implemented some other code that sometimes gets around this problem), but, if you experiment with it, you will see that it is often very successful, especially once the messages are a paragraph or longer in length.