On the previous page we saw that a substitution cipher could be very easy to crack, especially if punctuation and the original word sizes are retained. The latter in particular is helpful as it lets you spot single letter words (which are probably 'A' or 'I' in the plain text), or the cipher text version of 'THE', the most common 3-letter word in English.
There is an obvious defence against such tactics: remove all spaces between words, and punctuation as well. Once the spaces between words have been removed the message can be broken up into chunks of 4 or 5 letters for ease of communication (see this example from Bletchley Park). Let's see if this really does make things more difficult.
At first glance, this method seems to have worked: gone are the easily identifiable single letter words or possible candidates for 'THE'.
We will certainly have to put to one side a search for the single letter words, but this method will not prevent us looking for 'THE's. The reason for this is that, even with all spaces removed between words, the most common 3-letter sequence or trigram in English is still 'THE'.
Again, this website will do this automatically for you. Copy the message above, click here and copy the message into the top box, before clicking somewhere else on the screen. The side panel on the left will automatically reveal to you the statistics about the most frequently occurring letters, including the most frequently occurring trigram.
If you do this, you will see that the most frequently occurring trigram is 'ZHW' (14 appearances in the message), followed by 'KQM' and 'UIW' (both with 6 appearances each). On the previous page, we saw that, on average, 'THE' typically appears 2.5 times more frequently than the next most frequent trigram in English. 14 is very close to 2.5 x 6, so it looks like we have already worked out 3 letters: ZHW >>> THE.
The next stage is the most difficult and will probably require some trial and error. The code-cracking part of this website makes it easy for you to experiment in this way. Copy the message from the top of this page into the top box on the code-cracking page, select 'Crack substitution cipher' underneath and then enter letters in the boxes that appear below (starting with ZHW >>> THE). You can add some new ones and easily remove any that do not seem to work.
I tried a few letter substitutions, some of which did not work, before focusing on the top row in the screenshot above (about a third of the way through the message), particularly the sequence of letters : '-thth-'. The only way to make sense of this sequence is to assume that there shoudl be a space in the middle of it: ...th th... . Notice also that the cipher text letter immediately before and after it is 'K'. There are only a few possible plain text letters that it could be: the remaining vowels (A,I,O,U) or 'R' (some words end in 'rth', e.g. worth). 'U' seems unlikely (how many words end in '...uth', apart from 'truth'?).
It does not take long to see what appears when you try the others (see info. box above for instructions about how to do so on this website). A bit of guesswork or experimentation at this stage is fine, but we can actually assist our guesswork through further study of letter frequencies. We have looked at trigrams, but bigrams (sequences of 2 letters) are also very useful.
We are trying to work out the plain text letter for 'K' in the cipher text. If you look at the screenshot listing the bigrams and trigrams from the message generated by this website, then you will see that the fourth most frequent bigram in this message is 'KQ'. We do not know what 'Q' is, although we know that it is not any letter from THE. So 'KQ' = 'one of A, I, O or R' followed by a letter that is not from THE. If you look down the chart of the most frequently occurring bigrams on Wikipedia (or, better still, look here), then the first bigram that satisfies these restrictions is 'IN', so let's try KQ >>> IN. When we do, we get this near the end of the message:
The phrase 'the Internet' is almost complete, so now we have G >>> R.
You may want to try finishing off this message. A clue for what to do next: we still haven't found the plain text letters 'A' and 'O', which are usually the third and fourth most frequently occurring letter in English. Enter all the letters that we have discovered so far, and then use the letter statistics in the siderbar panel to identify the most likely cipher text letters that might be 'A' or 'O'. As it happens, one of them is in the same place as for typical English, but the other is not ... Experiment!
Once you have got these, you will notice that we have found ETAOIN, the six most frequently occurring letters in typical English, but that the second most frequently occurring letter in this ciphertext ('B') has still not been identified. What could that be? A second clue: the top 10 most frequently occurring letters in typical English are ETAOINSHRD ... . Once you have identified 'B' you should be able to crack the rest of the message, or you could cheat and press the 'Crack substitution cipher' button!
There is real satisfaction to be gained from manually cracking a substitution cipher. Tools like the ones on this website take away the chore of having to actually count letter frequencies and let you concentrate on the actual deciphering of the message. But what if we wanted to get a computer to crack a substitution cipher for us? How would we program it to do that?