CipherTools

Caesar Cipher – How to automatically crack it

A problem: distinguishing English from nonsense

The first stage of getting a computer to automatically crack a Caesar cipher is obvious: get it to produce 26 messages, based on the 26 different shift sizes.

The second stage seems harder: how do we get the computer to work out which of the 26 messages is the correct English version? 25 of them will obviously be nonsense, but how does a computer know which ones are nonsense?

One answer immediately presents itself: give the computer access to a large dictionary file, i.e. a file containing thousands of English words. The computer can then check each of the 26 possible messages to see which ones contains the most English words. There are two (not insurmountable) problems with this:

  1. Getting hold of a suitable dictionary file and then loading it into our program;
  2. Working out how many English words are in each of the messages, especially, as sometimes happens with coded messages, all the spaces have been removed between words in order to disguise the original message more completely.

The first problem can be solved by browsing the Internet - such files are fairly easy to come across. The second problem can be solved through some fiddly coding. Fortunately, there is a quicker and more elegant way, and it relies on letter frequencies.

Letter frequencies

In any human language, there are certain letters that appear more frequently than others. In English, for example, the letter 'E' is, on average, the most commonly used letter, whereas 'Z' is typically the least commonly used letter.

In fact, in English, on average, the six most commonly used letters are ETAOIN, and the six least commonly used letters are VKJXQZ

This website makes it easy for you to see the letter frequencies for any message that you enter. Navgiate to either the Create Codes or Crack Codes pages, enter a message in the Input Box and then either press TAB or click somewhere else on the screen. The letter frequencies will appear automatically in the sidebar on the left of the screen.

For further information about letter frequenices, see this article on Wikipedia.

In Cracking Codes with Python by Al Sweigart, this fact about letter frquencies in English is used to generate a simple way of getting a computer to recognise English text:

  1. Generate all 26 messages based on the 26 different shift sizes
  2. For each message, work out a score as follows:
    • Every time one of the ETAOIN letters occurs in the output text, add 1 to the score
    • Every time one of the VKJXQZ letters occurs in the output text, subtract 1 from the score
  3. The message with the highest score is the one that is most likely to be the message that is in English (i.e. not nonsense)

The beauty of this method is that it does not require a dictionary or even spaces between the letters to help identify the actual words. It just relies on the fact that, in any given chunk of English text, the most commonly occurring letters tend to be ETAOIN, whereas the least commonly occurring letters tend to be VKJXQZ.

It is not guaranteed to work every time, especially if the original message avoids the ETAOIN letters, e.g. "Fuzzy jug", but once your message is more than a few words long, it will have a very high success rate.

Python version

LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

def ApplyCaesarCipher(message, shiftSize):

outputText = ""
# Convert message to uppercase.
message = message.upper()
# Work through each character in the message.
for position in range(0, len(message)):

letter = message[position]
# Ignore all characters that are not letters
if letter in LETTERS:

# Turn letter into number, and subtract 65 (A = 65 in ASCII, Z = 90).
letterValue = ord(letter) - 65
# Add shift to letterValue and apply modular division
letterValue = letterValue + shiftSize
letterValue = letterValue % 26
# Convert result back to a letter, remembering to add 65 back on again first
newLetter = chr(letterValue + 65)
# Add new letter to output text.
outputText = outputText + newLetter

else:

# Non-letters can be added to output text without any changes being made.
outputText = outputText + letter

return outputText

def FindScore(message):

FREQUENTLETTERS = "ETAOIN"
INFREQUENTLETTERS = "VKJXQZ"
score = 0
for letter in message:

if letter in FREQUENTLETTERS:

score = score + 1

elif letter in INFREQUENTLETTERS:

score = score - 1

return score

message = ""
# Keep prompting the user until they enter a message
while message == "":

message = input("\n\nPlease enter the message to crack:\n")

shiftSize = 0
bestScore = FindScore(message)
bestShift = 0

# Loop through all 26 shifts and find score for each message
while shiftSize < 26:

outputText = ApplyCaesarCipher(message, shiftSize)
score = FindScore(outputText)
if score > bestScore:

bestScore = score
bestShift = shiftSize

shiftSize = shiftSize + 1

outputText = ApplyCaesarCipher(message, bestShift)

print("Shift size:\t",bestShift, "\nOutput:\n" + outputText)

Cipher Challenge competition    Leave feedback