This past weekend I dusted off my prototypical Ruby implementation of Markov chains for the purposes of generating sentences that bear striking similarities to a corpus of sample text, but are in fact random nonesense text. My first exposure to this idea was the implementation in Kernigan and Pike’s Practice of Programming, but I’ve run across it a number of times since.
Most Markov text generation schemes I’ve run across are just for fun, like mixing the text characteristics of the Bible and Dr. Seuss or whatever. My idea was to use Markov text generation to generate memorable, secure passphrases which resemble a familiar text. I figured Markov chains would generate structurally sound sentences which would enable users to remember the sentence in terms of its structure rather than a random sequence of words, which is a common cognitive trick to remember long strings. I’m not the first to have this idea; at least passkool implements a variation of the same idea.
Markov chains aren’t hard to implement, and after a few hours I had a working implementation and some unit tests. However, I wanted something a little different: I wanted to compute the information theoretic entropy of each generated string, so users could ensure the strength of their passphrase was commensurate with the key or data being protected.
Shannon’s theory of information established a formalized definition of information entropy, which allows us to determine exactly how many bits of information are encoded in a particular variable given the probabilities of each of the variable’s possible values. This adapts very nicely to Markov chains, which are themselves really just states linked by state transitions of various probabilities. Using this formalization, I can use Markov chains to generate some text, and determine how many bits of information are encoded in the text.
The reason this is cool is that it relates directly to cryptography. Since humans tend to be unable to reliably memorize long binary cryptographic keys, we’ve taken to using passwords (or, hopefully, passphrases) which can be cryptographically converted into long binary cryptographic keys and are usually easier for humans to remember. The problem with this approach is that, if you’re not careful, you’ll kneecap your encryption algorithm by using a week passphrase.
For example, let’s say you need a 128-bit AES key, and rather than remember the key (or even worse, write it down!) you derive it from a passphrase which you can remember. Once derived, you use the key confident that even the US government probably can’t break your 128-bit encryption. However, it’s quite possible you’re actually using what is effectively 32-bit encryption, or possibly less, depending upon your passphrase. Is your passphrase something obvious like your name, the word “secret”, or a dictionary word? Then it’s not as secure as the 128-bit key it’s being used to derive. A random 128-bit key has 2^128 (a huge number, believe me!) possible values, but your shitty passphrase could be guessed within maybe a few million tries, which is easy to brute-force with modern computers.
Security professionals who understand this problem give us rules of thumb to relate the length and composition of a passphrase with a corresponding key strength (for example, the 1.2 bits per character rule), but this is only a rough approximation of security. If you pick a quote from Roget’s, the quote might be 100 characters long, and thus 120 bits strong, but that’s only true for an adversary who will try to guess your passphrase at random based on English letter order probabilities. If the adversary knows or has reason to guess you took the quote from a quote book, the security of the key is considerably lower. If the book has 10,000 quotes in it and you picked one at random, that’s -(1/10000 * LOG(1/10000, 2)) * 10000 bits of entropy, or about 14 bits. Any attacker worth his salt can try all quotes from the quote book in seconds.
This is where Markov text generation comes in. Since the text is generated from a series of state transition probabilities, it’s easy to compute the exact entropy level (that is, key strength) of each generated string. That’s what the measure_entropy_for_tokens method of my Markov class does. With this measurement, you can be confident that an attacker who can precisely duplicate your training corpus and Markov chain parameters nonetheless must brute-force the phrase with a level of difficulty comparable to brute-forcing a cryptographic key with similar entropy.
Once I had this implemented, I started to generate sentences from all sorts of sample text from the collected works of Rudyard Kipling to a wide assortment of sci-fi. Whether I generated vaguely-pronounceable nonsense words or whole sentences, the length of text required to reach 128 bits of entropy was alot more than I expected. This was made even worse by the occasional word strings which had zero entropy (meaning there was no chance any other word would be selected) due unique combinations of words in the source corpus.
Here are some examples I generated using a collection of children’s books from Project Gutenberg. I find kids books have simpler sentence structure and a smaller vocabulary, so they make for easier to remember passphrases:
Bill got hurt in their banishment (16.01220550168 bits)
But if you wish, distribute this etext electronically, or by disk, book or any little girls their lessons, and then ventured to move some time before morning the good Saint come to you,’ said Fergus, ‘with greetings from Concobar the King likewise to Fergus, and he wasn’t black (84.2709242256745 bits)
I should use nought save a half-dozen jealously guarded little precincts of good cheer (25.6622696871773 bits)
I did seek it (17.3989261460847 bits)
You see Lightfoot has no hair on him (24.9873805812463 bits)
Shadow is the child, most fair (20.460032160771 bits)
Christmas is going to Johnny, rubbed her head on one of your making such a hard white crust on the shore (51.4191467788532 bits)
The white snow fell softly, softly, and then he sometimes does great damage (36.4897175557749 bits)
WHO IS THERE? she said she’d marry me (11.7074031814505 bits)
Here’s Martha, mother! cried the two big caterpillars, a lizard, a small gold ring began to fly at intervals, like a drill, or as if you have already enjoyed them—without knowing or wondering why (61.6564060837357 bits)
Yes, Mammy, said Epaminondas (12.5248260288151 bits)
*These Etexts Prepared By Hundreds of Volunteers and financial support to provide volunteers with the Mouse family (20.1629955228745 bits)
Martha didn’t like to feel just as useful as you can, and very pretty song (36.138788709601 bits)
Note the entropy measurements next to each phrase. These entropy figures are specific to the exact corpus I used to train the model.
Imagine you need 128 bits of entropy; you’d need to combine at least two and possibly six or more of these phrases depending on the strength of each one. I’m not sure I could reliably remember such a passphrase, and I certainly couldn’t accommodate a great number of them for various accounts.
I think the lesson here is that Markov text generation is a good approach to passphrase generation, and that 128 bits of entropy is alot of information for the human brain to contain. I suspect there are some optimizations to be had to pack more entropy into a more memorable package, but the real limitation here is human memory capacity.
The code for the Markov implementation, tests, and generator tool is on my SVN repository here. I didn’t upload my corpus to SVN since it includes some copyrighted works; I suggest Project Gutenberg as a great source of public domain text files.
I am planning to downloadyour
I am planning to downloadyour generator tool and am getting hang with it yet. Anyway I do agre that Markov text generation is a good approach to passphrase generation but its bit complicated.