Notes on the D'Agapeyeff Cipher

Robert A. J. Matthews
r.matthews(AT)physics.org


1. Origin
The cipher appears in  Codes and Ciphers by Alexander D'Agapeyeff (Ist Edition Oxford University Press 1939,
p 158), an elementary textbook on cryptology covering most classical encryption methods. It is set as an
exercise for the reader on p 158, the final page of the final chapter, which covers methods of decipherment. It
simply says: "Here is a cryptogram upon which the reader is invited to test his skill". The cipher was deleted
from later editions of the book.

2. Form of cipher
As published, it appears as 79 five-digit groups, (395 digits in all), beginning 75628 28591 62916...and ending
.....58162 92000 (see Appendix). The last three digits appear to be filler nulls, so that there are in fact 392
digits. There are absolutely no clues to its construction (substitution, transposition, both, nihilist, Vigenere etc).
Nor are there any clues to its contents (e.g. possible words).

3. Current status
It remains unbroken, and is arguably the single most famous unbroken public-domain cipher not considered to
be a hoax (such as the Beale cipher).

4. Past attempts to break the cipher
Very few articles have appeared on attempts to break the cipher: two in The Cryptogram magazine 1952 and
1959, and one in Cryptologia in 1978. The latter makes some useful observations:

(a). If the last three digits are nulls, then the 392 remaining digits divide in 192 two-digit pairs, and 192 is 14 x
14, strongly suggesting some form of square-matrix transposition has been imposed.
(b). The 196 two-digit pairs are grouped such that the first digit is 6,7,8,9 or 0, while the second digit is
1,2,3,4,5 - e.g. in the extract given above, we have 75 62 82 85 91 etc. This allows 25 different combinations,
strongly suggestive of a substitution of two-digit numbers for each letter of the alphabet, with (say) ij combined
(as is often done). The encipherment would then be as a 5 x 5 matrix, with the letters written into it as the
elements, and the digits read off as their coordinates, for example:

1         2         3         4        5
6         P         R         F         N         C          so that 61 = "P" etc
7         Y         B (etc)
8
9
0

(c). My own work strengthens the evidence for this "two digits = one letter" rule:
the 9 most frequent two-digit pairs make up 74 per cent of the total in the D'Agapeyeff cipher, compared to an
expected value of 70 per cent from English texts. Some mathematical analysis I've done also suggests that the
substitution is a simple one (i.e. not a Vigenere etc). The relative flatness of the frequency distribution of the
digits suggests, however, that some unusual words have been used in the cipher, with perhaps even complete
suppression of all "Es". This would certainly help explain the difficulty the cipher has presented amateurs.
(d). In the search for any other clues, I tracked down the son of D'Agapeyeff (also named Alexander, and a
former president of the British Computer Society). He does recall many amateurs contacting his father showing
him their (unsuccessful) efforts, and his father's embarrassment by the fuss caused. However, D'Agapeyeff
junior does not believe it was a hoax. That said, he could offer absolutely no help on its construction or
contents.
(e) It seems advisable to begin any attack by considering initially only the cipher systems mentioned in
D'Agapeyeff's book, before proceeding to others (eg Mirabeau, suggested by some).

5. Possible computational attacks
To fix ideas, assume that the D'Agapeyeff really is a substitution followed by a 14 x 14 transposition. There are
then 14! = 9 x 10^10 possible arrangements of the columns of the ciphertext. Similarly, as there are 18 different
digit pairs in the D'Agapeyeff, then assuming they represent 18 of the 25 effective letters of the alphabet, we
have a search space for the substitution of order
25!
----------  =     5 x 10^5
18! x 7!
The total solution space is thus of the order

(9 x 10^10) x (5 x 10^5) = 5 x 10^16

(a) Brute force attack
This total solution spaces is a challenge even for distributed computing techniques (Paul Leyland, Microsoft
Research, Cambridge, Private Comm 4/9/03), as searching (half of) 5 x 10^16 keys at 1 s per computation =
800 machine-years of computation. This is a major time investment for a trivial problem, especially given that
the assumptions underlying the keyspace size calculation may not be correct.  
The transposition element is brute-forcible (searching half of 9 x 10^10 keys at 1 mu-s per computation = 1
machine-day of computation) but the problem here is that there is no clearly unambiguous way to distinguish
the right sequence from the wrong one without detecting English in the plaintext - and that, of course, requires
that the substitution element has been previously broken. That said, it may be possible to shrink the
transposition problem quite considerably by using the Turingesque approach of terminating the search of a
certain part of the transposition keyspace as soon as an inconsistency emerges. In the case of transpositions,
this could be triple repeats - that is, occurrences of digit pairs such as 81 81 81. Triple repeats of letters are
extremely rare in English, and none would be expected in a text of the length of the D'Agapeyeff. (It may be
worth calculating how many triple repeats one would expect in the ciphertext by chance, to gauge the
effectiveness of this strategy - a high probability of accidental triple repeats would make it very worth while).
Another possibility, suggested by Lee Bohan, formerly of Aston University, is to use a dictionary-based attack
on the transposition, by generating a set of possible transposition sequences on the assumption that they were
generated from real words in English. Such a set could be rapidly generated from an electronic dictionary. To
cover the possibility that the transposition may be 7 x 28 rather than 14 x 14, he suggests including words of 7
letters and longer in the search. Even so, this technique shares the handicap of all brute-forcing attacks, apart
from the triple-letter method : both elements of the encipherment - the transposition and the substitution - have
to be solved simultaneously.

(b) Search algorithms
A more sophisticated approach is, of course, to use search algorithms such as genetic algorithms.
Such an attack requires a fitness function, and this would come from an analysis of the language structure of
each attempted solution. As transpositions scramble columns of text, single-letter frequencies are not sufficient:
bigram frequencies have to be used. The fitness function would then consist of counting the number of pairs of
letters of the form TH, AN, ER, RE, NE etc created, and comparing those with the number expected from a text
of the length of the D'Agapeyeff. Following from the argument above, it would also be worth incorporating
punishment for solutions leading to the creation of triple repeats.
A GA-based attack has recently been attempted by Antrobus at Keele University, and a paper is being
prepared by Antrobus, Rugg and myself reporting on the outcome.  I hope to include more details on this
shortly.

6. Concluding remarks
The D'Agapeyeff cipher represents an interesting challenge for search algorithms and/or distributed computing
methods, which should surely make short work of the likely solution space.


APPENDIX

The D'Agapeyeff Cipher
75628 28591 62916 48164 91748 58464 74748 28483 81638 18174
74826 26475 83828 49175 74658 37575 75936 36565 81638 17585
75756 46282 92857 46382 75748 38165 81848 56485 64858 56382
72628 36281 81728 16463 75828 16483 63828 58163 63630 47481
91918 46385 84656 48565 62946 26285 91859 17491 72756 46575
71658 36264 74818 28462 82649 18193 65626 48484 91838 57491
81657 27483 83858 28364 62726 26562 83759 27263 82827 27283
82858 47582 81837 28462 82837 58164 75748 58162 92000



Frequency distribution of number pairs

Pair                 81 62 75 82 85 64 83 74 63 91 65 84 72 92 93 71 94 04
No.   
              20 17 17 17 17 16 15 14 12 12 11 11   9   3   2   1   1   1
Freq (%)         10  9    9   9   9   8   8   7   6   6   6   6   5 1.5  1  .5  .5  .5