Notes on the D'Agapeyeff CipherRobert A. J. Matthewsr.matthews(AT)physics.org1. OriginThe cipher appears in Codes and Ciphers by Alexander D'Agapeyeff (Ist Edition Oxford University Press 1939, p 158), an elementary textbook on cryptology covering most classical encryption methods. It is set as an exercise for the reader on p 158, the final page of the final chapter, which covers methods of decipherment. It simply says: "Here is a cryptogram upon which the reader is invited to test his skill". The cipher was deleted from later editions of the book. 2. Form of cipherAs published, it appears as 79 five-digit groups, (395 digits in all), beginning 75628 28591 62916...and ending .....58162 92000 (see Appendix). The last three digits appear to be filler nulls, so that there are in fact 392 digits. There are absolutely no clues to its construction (substitution, transposition, both, nihilist, Vigenere etc). Nor are there any clues to its contents (e.g. possible words). 3. Current statusIt remains unbroken, and is arguably the single most famous unbroken public-domain cipher not considered to be a hoax (such as the Beale cipher). 4. Past attempts to break the cipherVery few articles have appeared on attempts to break the cipher: two in The Cryptogram magazine 1952 and 1959, and one in Cryptologia in 1978. The latter makes some useful observations: If the last three digits are nulls, then the 392 remaining digits divide in 192 two-digit pairs, and 192 is 14 x (a).14, strongly suggesting some form of square-matrix transposition has been imposed. The 196 two-digit pairs are grouped such that the first digit is 6,7,8,9 or 0, while the second digit is (b).1,2,3,4,5 - e.g. in the extract given above, we have 75 62 82 85 91 etc. This allows 25 different combinations, strongly suggestive of a substitution of two-digit numbers for each letter of the alphabet, with (say) ij combined (as is often done). The encipherment would then be as a 5 x 5 matrix, with the letters written into it as the elements, and the digits read off as their coordinates, for example: 1 2 3 4 5 6 P R F N C so that 61 = "P" etc 7 Y B (etc) 8 9 0 My own work strengthens the evidence for this "two digits = one letter" rule:(c).the 9 most frequent two-digit pairs make up 74 per cent of the total in the D'Agapeyeff cipher, compared to an expected value of 70 per cent from English texts. Some mathematical analysis I've done also suggests that the substitution is a simple one (i.e. not a Vigenere etc). The relative flatness of the frequency distribution of the digits suggests, however, that some unusual words have been used in the cipher, with perhaps even complete suppression of all "Es". This would certainly help explain the difficulty the cipher has presented amateurs. In the search for any other clues, I tracked down the son of D'Agapeyeff (also named Alexander, and a (d).former president of the British Computer Society). He does recall many amateurs contacting his father showing him their (unsuccessful) efforts, and his father's embarrassment by the fuss caused. However, D'Agapeyeff junior does not believe it was a hoax. That said, he could offer absolutely no help on its construction or contents. It seems advisable to begin any attack by considering initially only the cipher systems mentioned in (e)D'Agapeyeff's book, before proceeding to others (eg Mirabeau, suggested by some). 5. Possible computational attacksTo fix ideas, assume that the D'Agapeyeff really is a substitution followed by a 14 x 14 transposition. There are then 14! = 9 x 10^10 possible arrangements of the columns of the ciphertext. Similarly, as there are 18 different digit pairs in the D'Agapeyeff, then assuming they represent 18 of the 25 effective letters of the alphabet, we have a search space for the substitution of order 25! ---------- = 5 x 10^5 18! x 7! The total solution space is thus of the order (9 x 10^10) x (5 x 10^5) = 5 x 10^16 This total solution spaces is a challenge even for distributed computing techniques (Paul Leyland, Microsoft (a) Brute force attackResearch, Cambridge, Private Comm 4/9/03), as searching (half of) 5 x 10^16 keys at 1 s per computation = 800 machine-years of computation. This is a major time investment for a trivial problem, especially given that the assumptions underlying the keyspace size calculation may not be correct. The transposition element is brute-forcible (searching half of 9 x 10^10 keys at 1 mu-s per computation = 1 machine-day of computation) but the problem here is that there is no clearly unambiguous way to distinguish the right sequence from the wrong one without detecting English in the plaintext - and that, of course, requires that the substitution element has been previously broken. That said, it may be possible to shrink the transposition problem quite considerably by using the Turingesque approach of terminating the search of a certain part of the transposition keyspace as soon as an inconsistency emerges. In the case of transpositions, this could be triple repeats - that is, occurrences of digit pairs such as 81 81 81. Triple repeats of letters are extremely rare in English, and none would be expected in a text of the length of the D'Agapeyeff. (It may be worth calculating how many triple repeats one would expect in the ciphertext by chance, to gauge the effectiveness of this strategy - a high probability of accidental triple repeats would make it very worth while). Another possibility, suggested by Lee Bohan, formerly of Aston University, is to use a dictionary-based attack on the transposition, by generating a set of possible transposition sequences on the assumption that they were generated from real words in English. Such a set could be rapidly generated from an electronic dictionary. To cover the possibility that the transposition may be 7 x 28 rather than 14 x 14, he suggests including words of 7 letters and longer in the search. Even so, this technique shares the handicap of all brute-forcing attacks, apart from the triple-letter method : both elements of the encipherment - the transposition and the substitution - have to be solved simultaneously. A more sophisticated approach is, of course, to use search algorithms such as genetic algorithms.(b) Search algorithmsSuch an attack requires a fitness function, and this would come from an analysis of the language structure of each attempted solution. As transpositions scramble columns of text, single-letter frequencies are not sufficient: bigram frequencies have to be used. The fitness function would then consist of counting the number of pairs of letters of the form TH, AN, ER, RE, NE etc created, and comparing those with the number expected from a text of the length of the D'Agapeyeff. Following from the argument above, it would also be worth incorporating punishment for solutions leading to the creation of triple repeats. A GA-based attack has recently been attempted by Antrobus at Keele University, and a paper is being prepared by Antrobus, Rugg and myself reporting on the outcome. I hope to include more details on this shortly. 6. Concluding remarksThe D'Agapeyeff cipher represents an interesting challenge for search algorithms and/or distributed computing methods, which should surely make short work of the likely solution space. 75628 28591 62916 48164 91748 58464 74748 28483 81638 18174APPENDIXThe D'Agapeyeff Cipher 74826 26475 83828 49175 74658 37575 75936 36565 81638 17585 75756 46282 92857 46382 75748 38165 81848 56485 64858 56382 72628 36281 81728 16463 75828 16483 63828 58163 63630 47481 91918 46385 84656 48565 62946 26285 91859 17491 72756 46575 71658 36264 74818 28462 82649 18193 65626 48484 91838 57491 81657 27483 83858 28364 62726 26562 83759 27263 82827 27283 82858 47582 81837 28462 82837 58164 75748 58162 92000 Frequency distribution of number pairsPair 81 62 75 82 85 64 83 74 63 91 65 84 72 92 93 71 94 04 20 17 17 17 17 16 15 14 12 12 11 11 9 3 2 1 1 1No. Freq (%) 10 9 9 9 9 8 8 7 6 6 6 6 5 1.5 1 .5 .5 .5 |