Fun With Markov

As I was reading a block of consultant-speak produced by my employer, I found myself thinking how amusing it would be to combine the pretentious, inscrutable MBA-speak style one reads in analyst reports and New Economy jerk-off mags, with content that is jarringly different, such as the Bible, erotic fiction, or Dr. Seuss.
To explore this idea further, I turned to Markov text generation. Using Markov models to synthesize almost-legible texts which eerily preserve the tone and style of a given input text is a well-established technique; I simply want to experiment with it a bit. A google search for ‘markov text’ yields all the relevant background.
In particular, I’m interesting in combining multiple texts to produce a synthesis of their styles. Alice in Wonderland and Revelations has already been done, with semi-comical results, but I wonder if there isn’t more to be done.
Here are some notes I took whilst at the office reflecting on this subject:
Experiment with:

  • Providing a corpus of text containing baseline verbiage. The probabilities from this text would be subtracted from the sample text probabilities, so the distinguishing words and phrases of the sample text are exposed.
  • Web-based tool w/ URL and text area text sources
  • Combine multiple sources, and assign weights to each source to boost the probability of words/phrases from one source over the other.
  • Find some way to combine texts in an interesting way, such as Dr. Seuss’ novel structure w/ analyst whitepapers’ inscrutable MBA-speak.
  • Markov text from logged IM conversations, recreating the IM style of both participants. Markov text from IRC transcripts
  • Implementation that detected headers from body text and applies separate text generation logic to both
  • Amazon Top 100 list title and author names re-generated with markov text. Perhaps reviews as well.
  • Blog post generator; given an RSS feed URL, use markov text to generate another post. Perhaps include mix-ins from other blogs.
  • Markov text from spam/phishing scams

Update:
I’ve refined my thinking on this subject. I think Markov text generation would benefit from more intelligence and more amusing combinations of literary styles if it included information about parts of speech. For example, knowing the the word ‘ham’ comes after ‘green eggs and’ with a high probability is useful, but knowing that ‘ham’ is a noun makes it more likely that other nouns can be substituted for ‘ham’ without impacting the readability of the resulting text.
This matters to me because I want to be able to combine some aspects of one corpus (perhaps sentence structure or compositional style) with other aspects of another corpus (vocabulary specifically), to combine the two. By encoding part-of-speech information, the distintive vocabulary words from one corpus can be overlaid atop the compositional style of another corpus, to generate phrases like ‘green eggs and optionality’ or ‘And God sent down a foglifter from heaven’, etc.
The Moby project has an extensive database of English words encoded by part of speech, and a Baysian predictor could be used to guess the part of speech of unknown words based on context. Perhaps additional rules could infer speech part based on prefixes or suffixes, which is particularly important in a language like English, where nouns can become verbs so easily. Of course, irregular verbs make this more difficult than it needs to be.
Once this information is captured, the distintive characteristics of particular documents could be identified by comparing them w/ a large corpus of ‘plain’ writing. Unique words, phrases, and structures would emerge if the traits of the baseline corpus were subtracted, leaving only the differential elements.
By quantitatively identifing what distinguishes Dr. Seuss from a New York Times article (apart from the superior cogent, unbiased depth and breadth of Seuss), it should be possible to introduce Seussian elements into a New York Times article (thereby making it better).
Another update:
Apparently part of what I’m trying to do (ascertain the part of speech a word belongs to) is a problem well known to computational linguistics. In CL, algorithms that determine the part of speech of a token are called taggers, as discussed in Tagging the Teleman Corpus. Apparently there are a number of approaches, including a variation of the Hidden Markov Model I was proposing, Bayesian classification, statistical approaches, and neural networks.
Typically when the academecians put themselves to a problem it becomes inaccessible to the laypeople. This is probably no exception, but maybe I’ll be able to extract a few nuggets of wisdom I can use.

Tags: , ,

Leave a Reply