Cryptography

CRYPTOGRAPHY

INTRODUCTION

The word cryptography means "hidden writing". Cryptography is one part of cryptology, the study of 'that which is hidden'. The other part is cryptanalysis, the deciphering of hidden writing.

Cryptography has been used probably as long as writing itself, certainly many thousands of years, whenever one party wanted to send a message to another party that they wanted no-one to see.

Secret messages until recently have been mostly the province of the government and the military. However, with the advent of computers and particularly the internet, cryptography is now widely used for all types of transactions that each of us performs daily. And it is not just confined to written messages. It is used in mobile telephone conversations, all financial transactions and in many other facets of life.

In this brief discussion we will confine ourselves to messages that originate from some intelligence and that use a known language such as English. Languages that don't use the Roman alphabet can of course be encoded, but they are first reduced from language symbols into numbers, and in the end usually binary numbers involving only two symbols, one and zero, that can be represented in a computer memory or transmission medium as an off or on state.

ENCODING I

Consider the writing below:

Some might say that it looks all Greek to them, and indeed the text is formed from the letters of the Greek alphabet. However the language is not Greek, it is in fact English, encoded in the Greek alphabet. It is a short passage from the novel "The Stars like Dust" by Isaac Asimov.

This is one of the easiest ways to encode a text message. All you have to do is set the font on your WORD document to 'Symbol' and the message you had typed or imported into the word processor will magically be encoded with the Greek letters. Now that you know how the encoding was accomplished you should have no difficulty in decoding the message. In fact, even without this knowledge some people might be able to decode the message. It is certainly not a secure way of encoding plain text.

SUBSTITUTION

Probably the most widely used method of encoding a message is that of substitution. For each letter a symbol is used. That symbol may be a different letter of the alphabet, or it may be a symbol that has nothing to do with the alphabet.

The most common alphabetic substitution is to replace a letter with another letter several letters later in the alphabetic sequence. If we choose a spacing of 5 then 'a' becomes 'f', 'b' becomes 'g', 'c' becomes 'h' and so on. We regard the alphabetic sequence as a loop and so 'v' becomes 'a', 'w' becomes 'b', 'x' becomes 'c', 'y' becomes 'd' and 'z' becomes 'e'. And so the plain text:

this is not a secure method of encoding becomes ymnx nx sty f xjhzwj rjwmti tk jshtinsl

You might eliminate the spaces and send the encoded message as:

ymnxnxstyfxjhzwjrjwmtitkjshtinsl

but you still do not have a secure method of encoding, for a reason we will mention a little later.

The decoding of the latter message is done simply by replacing each letter in the code with a letter 5 back in the alphabet. This will produce the 'plain' text:

thisisnotasecuremethodofencoding

which relies on the decoder to insert spaces in the correct locations. Some untrained people might have a problem doing this, but with practice it becomes very easy.

The next step is to eliminate use of alphabetic symbols entirely, and replace each letter with a 'symbol':

Our plain text without spaces then becomes:

which again, to an untrained person seems absolutely indecipherable. However, a trained person can probably decipher it without any aids.

CRYPTANALYSIS

Before we talk any more about encryption we need to discuss some decryption techniques. For texts that have been encoded (or encrypted) by a simple symbol substitution (as above), the trick lies in the occurrence frequency of letters in the alphabet. Each letter of the alphabet appears a certain percentage of times in a large volume of text, and these percentages are fairly constant. The table below gives these percentages for text in the English language. The same phenomenon occurs in all 'natural' languages, only the percentages vary. The table below gives these percentages for English language text.

Looking through this table we can see that 'e' is the most frequently used letter by far, followed by 't' and then 'a'. Continuing in this vein we can order the letters by frequency of occurrence and come up with:

e t a o i n s h r d l - c u w m f g y p b v k j x z q

Only the first seven letters are significant in decoding medium size messages, and for short messages it has to be used with care (for instance in our message above the frequency of 'e' and 'i' are both the same - they are each used 3 times. The letters following the hyphen have significantly smaller frequencies are rarely of use in decoding.

Fortunately the English language is redundant enough that we can usually guess each word after we have replaced each of the seven most frequent symbols with their alphabetic letter from the frequency table. And if we have to swap vowels - like 'e' and 'i' or 'a' and 'o' that is usually not a problem. Notice that of the first seven most frequent letters, four of them are vowels. This might suggest that deleting vowels in a message before encoding might help to fool this type of decoding technique. And indeed, once again some people have no difficulty in reading a message with no vowels. See what you think of the following message:

ths s nt scr mthd f ncdng

which is of course our previous message. However, it becomes more difficult if we remove the spaces. As a hint, we know there are no single letter words 's' or 'f'. And it doesn't take much trial and error to get the right vowels.

ENCODING II

With an idea of how the cryptanalysts decode a message using substitution we can come up with some ideas of how to better improve our encoding to defeat them.

One obvious way is to remove the frequency bias in the encoded message and one simple way to do that is to encode the more frequent letters with more than symbol, randomly chose. And the multiplicity of symbols will be exactly in proportion to the frequency of the letter.

The easiest way to do this is to round the percentage frequencies into integers and assign this number of symbols to the letter concerned. For instance, we assign the letter 'e' 13 symbols which are chosen randomly each time we have to encode an 'e', nine symbols with which to encode 't', eight symbols for 'a', eight symbols for 'o', seven symbols for 'i' and so on.

One disadvantage with this method is that we need 104 symbols to encode our message. However, remember that it doesn't matter what our symbols are. They might just as easily be numbers (from 1 to 104) because when our cryptanalyst starts work on a message with strange symbols the first step is to replace each symbol with a number. This is particularly so when the cryptanalyst is using a computer. And the reason for this is that computers work with numbers. In fact, this text is being typed into a computer and the computer uses what is called ASCII code to change each of my typed letters into a number (ASCII stands for "American Standard Code for Information Interchange"). It is the numbers (in binary) that are stored in the computer file.

To encrypt a text message with our 104 symbol key we first need to jumble (or randomise) our key. We might do this by first using an array to represent each letter:

a(8), b(2), c(3), d(4), e(13), f(2), g(2), h(6), i(7), j(1), . . . .

Note that letters with a frequency less than 1 we assign only one symbol. Next we assign a unique random number between 1 and 104 for each symbol. For example we might choose a(1)=64, a(2)=43, a(3)=102, ..., b(1) = 2, b(2)=75, c(1)=99 and so on. This of course would be done by a computer with a good random number generator.

We then produce the encoded message by also using a random number generator. For instance, when we wish to encode an 'a' we ask the computer for a random number between 1 and 8. Let us assume this is a 3. Then we look up the array a(3) and find out encoded number is 102. This forms the first symbol of our encoded message. The next time we come to encoding an 'a' our random number ( between 1 and 8) might be 2 in which case our code symbol is 43. For the letter 'e' we need to ask the random number generator to produce a random number between 1 and 13. For letters with frequencies of 1 or less we do not need a random number, we simply use the sole number in the array (which was previously chosen randomly). Note that this method of encoding does not make the encoded text any longer then the plain text. It simply makes the key longer.

We can send out our coded message with the space in the same places as the plain text, or we can incorporate them into our encoding. We do this by noting that the average size of a word in English is 5.5 letters long. So spaces are equivalent to a letter with a frequency of 16% (100 / 6.5). We thus need to increase out encoding symbol space to ~120 and figure out a new frequency table that includes spaces. Note that the addition of spaces will change the frequencies of the other letters.

Unless you know the method of encryption ( ie the size of the encryption key and the way it is used) the only way that a message which shows a flat frequency spectrum is by brute force. That means by trial and error or assigning each possible combination letter combination until meaningful text appears. For a key of only 120 symbols this is quite simple for a supercomputer (which is now what all good (or bad) cryptanalysts use). You might ask how the computer knows it has decoded to a meaningful text. Simply by a look up dictionary of a few thousand meaningful words.

TRANSPOSITION

Substitution is one method of encryption. This can be done in many ways. The second main method is transposition. This involves dividing a plain text message up into blocks with a certain number of letters per block. For instance, if we use blocks of 16, then we might think of that as an 4x4 square array. We then number each cell of the array from 1 to 16 in order as shown below. We then randomly jumble the position of the letters in the bottom left array.

As our plain text consists of 32 letters (not counting spaces) we will need two blocks to encode the message. It is first written in the two upper tight blocks proceeding in order from, left to right and then top to bottom. We then use the encoding positions to position the letters onto the two lower right arrays. We can then write our message by reading these two lower blocks left to right and top to bottom. The coded text we get is:

euihnsaretostsicgioeohnnomfdectd

To decode the message we simply divide the code into blocks of 16 and then proceed to rearrange the letters in each block by moving cells back into the positions stated in the lower left numbered array. Note that because we did not make a substitution, the letter frequencies will not be altered and will conform to the frequencies given previously. Because of this it is usual for 'secure' cryptosystems to use both substitution and transposition.

XOR

The basic operations available to a computer are the arithmetic ones of addition, subtraction, multiplication, division, and exponentiation. They also find it easy (ie fast) to use the boolean logic operations of AND, OR, NAND, NOR and XOR (exclusive OR). These operations are done on ones and zeroes or bits and are defined in the table below:

Of these operators, XOR is the most useful to cryptographers. This is because XOR is a symmetrical operation. You can use is to encode and then to decode. That is, applied once it will give a code, then when applied again it will give the original logic 'number'. That is

0 xor 0 = 0 xor 0 = 0
0 xor 1 = 1 xor 1 = 0
1 xor 0 = 1 xor 0 = 1
1 xor 1 = 0 xor 1 = 1

The first and the last symbols are always the same after two operations using the XOR operator. That is, a double application of XOR returns the original value. This can be done on bits but it can also be done on letters or symbols after they have been turned into binary numbers. The XOR operator is thus a very powerful tool for encoding. If the text and the key are the same length the XOR operator can be used without modification of either. That is

(text1) XOR (key) = (code1)
(code1) XOR (key) = (text1)

This is a very fast way for a computer to code and decode messages. As long as the sender and recipient and no one else have the key, the message is secure. If a third party knows anything about the key (like its length and the method (or random number generator) used to generate it) the message will not be secure. If this is not to happen no details must be made available. The only secure method of exchanging the key is for the two parties (ie individuals) to meet in secret and exchange the key. If more than two individuals are involved the system is compromised. That is because one of the most common methods of breaking a system is for someone to release the details of the method and/or the key. This can be done deliberately (eg say for profit) or it can happen inadvertently (eg when an outsider poses as someone who is on the inside).

As this is not possible is most instances there will always be a possibility of unintended decoding. Even in systems called public key cryptosystems there will always be a possibility of decryption (eg a break-in of the computer where the 'secret' password is kept may strike gold, even if the password is encrypted (like when the encryption algorithm is also found)).

It has said that a one-time-pad provides the only unbreakable code. This is the same as the last method described above when the key is the same length as the text and the key is changed for every new message generated. However, for either method to work, there must be two identical one-time-pads (or two identical keys). If these are not to be compromised in their exchange only two individuals must be involved, the sender and recipient. And if they don't physically meet for each exchange they must exchange multiple keys during a single meeting, keys which are then susceptible to discovery by a third party.

There have been very many books written about cryptology, a very few of which are suggested in the reference list below.

REFERENCES

Helen Gaines, Cryptanalysis, Dover (1956)

David Kahn, The Codebreakers, Scribner (1996)

Simon Singh, The Code Book, Fourth Estate (1999)

Sarah Flannery & David Flannery, In Code - A Mathematical Journey, Profile Books (2000)

Bruce Schneider, Applied Cryptography, Wiley (1996)

Christopher Swenson, Modern Cryptanalysis - Techniques for Advanced Code Breaking,, Wiley (2008)

Kevin D Mitnick & William L Simon, The Art of Deception, Wiley (2002)

ASA Australian Space Academy