# A Day In The Lyf

…the lyf so short, the craft so longe to lerne

## TDD’ing a Markov Chain

In The Practice of Programming, Brian Kernighan and Rob Pike develop a simple algorithm to scramble text, yet do so in a way that helps prevent turning the output into gibberish. The idea is simple to understand:

• Parse the input into a list of prefixes. A prefix of two words long seems to work well.
• For each prefix in the input text, keep a list of suffixes, where a suffix is the word immediately following the prefix. If the same suffix exists in the input multiple times for a prefix, it should be listed multiple times in the suffix list.
• To create the output text, starting with an initial prefix, randomly select a suffix for that prefix. Then slide your prefix over one word so that it now includes the last word of the previous prefix and the new suffix. Rinse, lather, repeat.

Kernighan and Pike suggest using sentinel values to help start and end the process. The algorithm is a simple example of a Markov chain algorithm.

I decided to give a Ruby implementation a go in a TDD fashion to see if I could learn something. I had the advantage of having already seen a few implementations by Kernighan and Pike, but to keep it somewhat fair, I didn’t consult the book during the implementation.

If you want to follow along with the code, you can download it here. To give you a sense of where we’re going, here’s the first stanza of Edgar Allen Poe’s The Raven, after scrambling:

Once upon a bust of Pallas just above my chamber door; –
This it is, and this mystery explore; –
‘Tis the wind and nothing more.

As always, the hardest part is just getting started. On string algorithms, I find starting with an empty string keeps me in my comfort zone:

``````class EmptyStringMarkovTest < Test::Unit::TestCase
def setup
@markov = Markov.new ''
end

def test_speak
assert_equal '', @markov.speak
end
end``````

`Test::Unit`, like most members of the xUnit family that work in reflection-equipped languages, relies on a simple naming convention—all methods starting with ‘test_’ are executed. It fails, of course, because `Markov` doesn’t exist yet. This should do the trick:

``````class Markov
def initialize(text)
end

def speak
''
end
end``````

In my experience, developers who struggle to understand TDD, or don’t want to understand TDD, look at the above code and wonder what the point is. I haven’t really tested anything, because I haven’t really implemented anything. However, I’ve designed the simplest thing I can think of that gets me what I want. Design is hard, and simple design is even harder (don’t confuse simple with easy).

The code above isn’t hard at all. This is the whole point of TDD, but many people struggle to grasp it. Some people, including those in the BDD camp, point out that many people misunderstand TDD because it has the word ‘test’ it. Were I writing a TDD logo, it’d look something like this: test DRIVEN development.

I know I need to parse the input into a list of prefixes. I’ll try that next.

``````def test_prefixes
assert_equal ["#{Markov::SENTINEL} #{Markov::SENTINEL}"],
@markov.prefixes
end``````

Surrounding the original text by sentinel values is one of those tricks that, once you see it, makes perfect sense. Of course, having read Kernighan & Pike’s examples, I’d already seen the trick. Since I’m in a dynamic language, I can run the test right now and see it fail with a `NoMethodError`. However, I think it’s more valuable to go ahead and add the missing constant and method, and make it fail because the implementation is wrong, not just because it’s missing. I added the following to `Markov`:

``````attr_reader :prefixes
SENTINEL = ""

def initialize(text)
@prefixes = []
end``````

If you’re not a Ruby programmer, the `attr_reader` method dynamically creates a `prefixes` getter method that returns the instance variable of the same name (the @ sign indicates an instance variable, and the [] indicates an empty array). This fails, of course, but it’s easy to fix using Ruby’s string interpolation:

``````def initialize(text)
@prefixes = ["#{SENTINEL} #{SENTINEL}"]
end``````

Yes, I know, I’m cheating. However, my tests pass, and I’d prefer having the tests force me to move away from hard-coding results in the system under test. To do that, I probably need to move away from the empty string:

``````class OneWordMarkovTest < Test::Unit::TestCase
def setup
@markov = Markov.new 'word'
end

def test_speak
assert_equal 'word', @markov.speak
end
end``````

I gave this test its own fixture. This is something Dave Astels and the BDD community recommend; if your tests require different state than tests you’ve already written, don’t hesitate to create a new class for it. Sticking to the one test class per production class is a smell.

After giving it a moment’s thought, I realize I can keep cheating. Maybe this wasn’t such a great test after all. That’s OK; realizing that development is a herky-jerky activity is part of the strength of TDD. Here’s the changed `Markov` class:

``````def initialize(text)
@original_text = text
@prefixes = ["#{SENTINEL} #{SENTINEL}"]
end

def speak
@original_text
end``````

Similarly, for the prefixes:

``````def test_prefixes
assert_equal ["#{Markov::SENTINEL} word ", "word #{Markov::SENTINEL}"],
@markov.prefixes
end``````

Notice that I’m including trailing whitespace as part of the prefix. I also intend to include trailing punctuation, so I don’t have to be too clever about parsing.

Finally my hard-coding ways have caught up with me. Now, for the first time, I actually have to think a little about implementation. I think that’s A Good Thing—interface first, implementation second. Here’s the changed code:

``````class Markov

SENTINEL = ""
WORDS_IN_PREFIX = 2

def initialize(text)
@original_text = "#{SENTINEL} #{text} #{SENTINEL}"
build_prefixes
end

def speak
@original_text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

private
def build_prefixes
@prefixes = []
words = @original_text.scan /\S+\s*/
max_index = words.length - WORDS_IN_PREFIX
(0..max_index).each do |i|
prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
@prefixes << prefix unless @prefixes.include?(prefix)
end
end
end``````

I added the sentinels around the input text and parsed it into a list of two-word prefixes (The `scan` method, which you can read about in the Pickaxe book, returns an array of all non-overlapping substrings matching the regular expression. Regular expressions are delimited by slashes). The `speak` method now has to remove the sentinels. This fixes my new test, but breaks my empty string prefix test. An empty string will leave two spaces between the sentinels (look at the constructor above—there’s a sentinel+space before the text, and a space+sentinel after the text). I fix the test and get back to green.

It occurs to me that I probably overdid `build_prefixes`. While I think it’s right, there some code in it, like checking for duplicate prefixes, that isn’t strictly dictated by the tests yet. TDD takes discipline to not add code that isn’t demanded by failing tests. Figuring out how to do that isn’t always easy, but in this case it shouldn’t be hard. I just got carried away.

I think it’s still too big of a jump to start expecting random output text according to our Markov chain algorithm, so I start focusing on collecting the suffixes:

``````# In EmptyStringMarkovTest
def test_suffixes
suffixes = {"#{Markov::SENTINEL}  #{Markov::SENTINEL}"=> []}
assert_equal suffixes, @markov.suffixes
end``````

The `suffixes` variable above represents a hashtable, with an empty array as the value for the prefix key. I can make this one work by hard-coding, and since parsing the suffixes is non-trivial, I think I will:

``````# In Markov

def initialize(text)
@original_text = "#{SENTINEL} #{text} #{SENTINEL}"
build_prefixes
@suffixes = {"#{SENTINEL} #{text} #{SENTINEL}" => []}
end``````

I’m using the same trick here I used for the prefix code above. I’m desigining the code around the simplest example I can think of, and hard-coding a result until forced otherwise. That helps me focus on interface over implementation, without getting bogged down in tricky parsing code. Here, the interface is using a hashtable for `suffixes`. Now, let’s force some implementation:

``````# In OneWordMarkovTest
def test_suffixes
suffixes = {
"#{Markov::SENTINEL} word " => ["#{Markov::SENTINEL}"],
"word #{Markov::SENTINEL}" => []
}
assert_equal suffixes, @markov.suffixes
end``````

This one requires some thought:

``````def initialize(text)
@original_text = "#{SENTINEL} #{text} #{SENTINEL}"
build_prefixes
build_suffixes
end

private
def build_suffixes
@suffixes = {}
@prefixes.each do |prefix|
@suffixes[prefix] = find_suffixes_for(prefix)
end
end``````

You’ll see here the same “fake it until you make it” attitude that I used to hard-code test results. I don’t know how to find the suffixes for a prefix, so I’ll just pretend like the function to do it is already written. That makes the `build_suffixes` easy, even if it doesn’t work yet. Various authors in the Agile community, including Ron Jeffries and Brian Marick, have recommended using this approach, and I can attest to its effectiveness. Really, it’s the same principle behind divide and conquer algorithms.

Here’s my first attempt at `find_suffixes_for`:

``````def find_suffixes_for(prefix)
pattern = Regexp.new("#{prefix}(\\S+\\s*)")
suffixes = []
text = @original_text

loop do
match = pattern.match(text)
break if match.nil?
suffixes << match
offset = match.offset(1)
text = text[offset..-1]
end
suffixes
end``````

This gets the test to pass. The `match` object returned by matching a regular expression returns the entire string as `match`, the first matched group (delimited by parenthesis in the regular expression) in `match`, etc. If we find a match, we add it to our suffixes array (using the << appending operator) and slice `text` to start right after the first character of the previous match.

Up until now, I haven’t had to do any refactoring, but it is the crucial last step to each test-code-refactor iteration. The design benefits of TDD depend on it, as it is what enables the ‘evolution’ in ‘evolutionary design’. Let’s face it: everybody loves to refactor. But refactoring takes discipline. Refactoring without tests is a lot like wearing a chastity belt—expect an initial surge of excitement to be followed by a prolonged period of frustration.

Looking over the code, I’m a bit bothered by the duplicated regular expression used to define the words in the text. Looks like I need to extract a constant:

``````# In Markov
WORD_REGEXP = /\S+\s*/

def build_prefixes
@prefixes = []
words = @original_text.scan WORD_REGEXP
max_index = words.length - WORDS_IN_PREFIX
(0..max_index).each do |i|
prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
@prefixes << prefix unless @prefixes.include?(prefix)
end
end

def find_suffixes_for(prefix)
pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
suffixes = []
text = @original_text

loop do
match = pattern.match(text)
break if match.nil?
suffixes << match
offset = match.offset(1)
text = text[offset..-1]
end
suffixes
end``````

I’ve pretty much exhausted my options with empty strings and one word strings. I have a good start on parsing the prefixes and suffixes, so maybe it’s time to start with the Markov chain. First, I need some input that forces the result of the `speak` method to be different than the original text. Since suffixes are randomly chosen, using a small string may not provide enough data to statistically guarantee a different string.

``````class StatisticalMarkovTest < Test::Unit::TestCase
def setup
@text = ('a b c ' * 10) + 'a b d '
@text *= 1000
@text = @text.strip
@markov = Markov.new @text
end

def test_prefixes
prefixes = ["#{Markov::SENTINEL} a ", "a b ", "b c ",
"c a ", "b d ", "d a ", "d #{Markov::SENTINEL}"]
assert_equal prefixes, @markov.prefixes
end

def test_suffixes
assert_equal 10000, suffixes_with("c ")
assert_equal 1000, suffixes_with("d ")
end

def suffixes_with(suffix)
@markov.suffixes["a b "].find_all {|s| s == suffix}.length
end
end``````

I wrote the prefixes test just to make sure I understood what they were. It passed the first time. I wrote the suffixes method to give me a little more confidence in my parsing. It also passed the first time. I’m not entirely sure either test was really useful, and it’s quite possible that I could have just been procrastinating.

``````def test_weighted_random
markov_text = @markov.speak
assert_not_equal @text, markov_text
end``````

I’m not done with this test, but I’ll leave it like this until I get back to green. After quite a bit of experimenting (encompassing more time than I usually like to go between passing tests), I come up with the following:

``````class Markov

SENTINEL = ""
WORDS_IN_PREFIX = 2
WORD_REGEXP = /\S+\s*/

def initialize(text)
@original_text = "#{SENTINEL} #{text} #{SENTINEL}"
build_prefixes
build_suffixes
end

def speak
text = words(@original_text)[0..1].join
text += random_suffix_for(text) until random_suffix_for(text).nil?
text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

private
def build_prefixes
@prefixes = []
words = words(@original_text)
max_index = words.length - WORDS_IN_PREFIX
(0..max_index).each do |i|
prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
@prefixes << prefix unless @prefixes.include?(prefix)
end
end

def build_suffixes
@suffixes = {}
@prefixes.each do |prefix|
@suffixes[prefix] = find_suffixes_for(prefix)
end
end

def find_suffixes_for(prefix)
pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
suffixes = []
text = @original_text

loop do
match = pattern.match(text)
break if match.nil?
suffixes << match
offset = match.offset(1)
text = text[offset..-1]
end
suffixes
end

def random_suffix_for(text)
words = words(text)
prefix = words[(words.length-WORDS_IN_PREFIX)..-1].join
suffixes = @suffixes[prefix]
return nil if suffixes.nil? || suffixes.length == 0
suffixes[rand(suffixes.length-1)]
end

def words(text)
# We're including punctuation as part of the word
text.scan(WORD_REGEXP)
end
end``````

I refactored the `scan` call into a `words` method while I was working because I need it in several places. This code works, but slowly. That’s OK—I’ll deal with that later.

This was a big jump, maybe too big. I might have been better off trying to come up with smaller steps. But the deed is done; let’s see if we can clean it up. Gritting my teeth, I decide to ignore the performance issue and go for some refactoring. The first thing I see is that I can make parsing the prefixes a bit more intention-revealing:

``````def speak
text = first_prefix(@original_text)
text += random_suffix_for(text) until random_suffix_for(text).nil?
text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
prefix = last_prefix(text)
suffixes = @suffixes[prefix]
return nil if suffixes.nil? || suffixes.length == 0
suffixes[rand(suffixes.length-1)]
end

def last_prefix(text)
words = words(text)
words[(words.length-WORDS_IN_PREFIX)..-1].join
end

def first_prefix(text)
words(text)[0..1].join
end``````

I make the change and run the tests. 106 seconds later, I see I’m still on green. Fine, I guess it’s time to deal with performance.

It’s important to note that this is the first time I’ve really considered performance in the implementation. As Kernighan and Pike say, the first rule of performance tuning is don’t. Even more caustic is Knuth’s (or Hoare’s?) comment that “premature optimization is the root of all evil.” It’s an evil I’ve committed too often in my career, and every time the software has been the worse for it. Write your code as cleanly and intention-revealing as possible. If, and only if, there are performance problems do you start optimizing. Remember, about 4% of your program will account for about 50% of the runtime; optimizing because you can means your probably wasting your time in the other 96% of the code.

I break out the profiler and run my slow test (ruby -r profile statistical_markov_test.rb -n weighted_random). Here’s a sample of the top of the output:

``````1 tests, 1 assertions, 0 failures, 0 errors
%   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
23.56   12.24     12.24        7  1748.57  2685.71  Kernel.loop
17.99   21.59      9.35     5668     1.65     1.65  String#scan
16.42   30.12      8.53        1  8530.00 18550.00  Range#each
8.95    34.77      4.65    33001     0.14     0.21  Array#include?
4.31    37.01      2.24   103016     0.02     0.02  String#==
3.50    38.83      1.82    77379     0.02     0.02  Array#[]
3.41    40.60      1.77     5665     0.31     2.44  Markov#random_suffix_for``````

It looks like the `String#scan` method is my main problem (Ruby documenters use the # character to indicate an instance method). That makes sense; it’s used in my `words` method, which is called every time a prefix is requested in `random_suffix_for`.

My first thought is to cache:

``````def words(text)
# We're including punctuation as part of the word
# This is the bottleneck of the algorithm, hence the caching
@words ||= {}
@words[text] ||= text.scan(WORD_REGEXP)
@words[text]
end``````

The best thing I can say about this is that it’s not the stupidest thing I’ve ever done. After wrestling with my machine for several minutes to regain control because of the massive pagefile swapping, I’m finally able to kill the ruby process running my tests.

Looking at the code I’m almost embarrassed to comment how the caching takes up memory but doesn’t actually get used. The `text` used as the key into the hashtable will always be different when getting suffixes. I’m shaking my head in shame. I really wish you had stopped me from making such a public fool of myself.

Now that I’m seeing things a little bit more clearly, I notice how the text that’s passed in is monotonically increasing in length, adding one suffix each time. Rather than scanning each time, we could scan once, and then add each suffix manually:

``````def speak
@text = first_prefix(@original_text)
@words = words(@text)
while (suffix = random_suffix_for(@text)) != nil
@text << suffix
@words << suffix
end
@text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
prefix = last_prefix
suffixes = @suffixes[prefix]
return nil if suffixes.nil? || suffixes.length == 0
suffixes[rand(suffixes.length-1)]
end

def last_prefix
@words[(@words.length-WORDS_IN_PREFIX)..-1].join
end``````

This runs in a respectable 5 seconds, which, if memory serves, puts it roughly in the ballpark Kernighan and Pike had.

Notice how, by keeping more state, I’ve eliminated the parameter to `last_prefix`. I think I can do the same thing for `random_suffix_for`:

``````def speak
@text = first_prefix(@original_text)
@words = words(@text)
while (suffix = random_suffix) != nil
@text << suffix
@words << suffix
end
@text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix
suffixes = @suffixes[last_prefix]
return nil if suffixes.nil? || suffixes.length == 0
suffixes[rand(suffixes.length-1)]
end``````

Scanning the rest of the code, I notice an egregious bit of duplication that I had somehow missed before:

``````def first_prefix(text)
words(text)[0..1].join
end``````

Where did the 1 come from in the array slice? It’s really this:

``````def first_prefix(text)
words(text)[0..(WORDS_IN_PREFIX-1)].join
end``````

I feel pretty close now, and I think it’s time to do some real testing. One of the things I remember Kernighan and Pike doing is verifying that, if ‘a b c ’ outnumbers ‘a b d ’ 10:1, then we should see 10 times as many ‘c’ suffixes as ‘d’ suffixes. This is where I was headed with the `test_weighted_random` method. However, I’ll need to count the number of occurrences of a substring, and I don’t see a library method to help me. No problem, this is Ruby after all. Here’s the code I ended up with after a couple TDD cycles, which have been compressed for your viewing convenience:

``````class StringTest < Test::Unit::TestCase
def test_zero_occurrences
assert_equal 0, 'the quick brown fox'.occurrences_of('cat')
end

def test_occurrences_in_empty_string
assert_equal 0, ''.occurrences_of('a')
assert_equal 1, ''.occurrences_of('')
end

def test_occurrences_of
text = ('a b c ') * 10 + 'a b d'
assert_equal 10, text.occurrences_of('a b c')
assert_equal 1, text.occurrences_of('a b d')
end
end

class String
def occurrences_of(substring)
count, i = 0, 0
loop do
i = self.index(substring, i)
break if i.nil?
count += 1
i += 1
end
count
end
end``````

Adding a method to an already existing class like this is known as aspect-oriented introduction. It’s a difficult, and sometimes impossible feat to do in many languages. In Ruby, it’s that easy.

Now I can change my statistical test:

``````def test_weighted_random
markov_text = @markov.speak
assert_not_equal @text, markov_text
count_bc = markov_text.occurrences_of('b c')
count_bd = markov_text.occurrences_of('b d')
ratio_bc_to_bd = count_bc / count_bd
assert ratio_bc_to_bd.between?(9, 11),
"there are #{ratio_bc_to_bd} times as many 'b c' fragments as 'b d' fragments (should be ~10)"
end``````

I’m pleased to see it work the first time. I decide to test with some real input now. I’d like to verify that every three word combination (2-word prefix + suffix) in the output also exists in the source text. To do that, I’ll need access to the `words` method. I decide it probably belongs in the `String` class too, so I move it and make a few tests for it:

``````class String
WORD_REGEXP = /\S+\s*/

def words
# We're including punctuation as part of the word
scan(WORD_REGEXP)
end
end

class StringTest < Test::Unit::TestCase
def test_words_in_empty_string
assert_equal [], ''.words
end

def test_words
assert_equal ['the ', 'quick ', 'brown ', 'fox'],
'the quick brown fox'.words
end

def test_words_include_punctuation
assert_equal ['first, ', 'second!  ', 'third? ', 'fourth...'],
'first, second!  third? fourth...'.words
end
end``````

Now I grab some real input and give it a go:

``````class ComplexMarkovTest < Test::Unit::TestCase
def setup
@markov = Markov.new the_raven
end

def test_speak
markov_text = @markov.speak
puts markov_text
assert_not_equal the_raven, markov_text, "should scramble text"

# each (prefix + suffix) combo must also be in original text
words = markov_text.words
phrase_size = Markov::WORDS_IN_PREFIX + 1
(0..words.length-phrase_size).each do |i|
phrase = words[i..(i+phrase_size-1)].join
assert the_raven.include?(phrase), " not in original"
end
end

def the_raven
# Returns the entire text of Edgar Allen Poe's "The Raven"
# download the source code if you want to see the whole thing...
end
end``````

I’m a bit surprised to see this one fail. After a good bit of playing around, I discover that I had an off-by-one error because I misunderstood how the library `rand` method worked. According to the documentation, if the parameter `max` is an integer greater than 0, then it returns a random integer in the range [0..max-1]). I was already subtracting one, so I was always leaving off the last suffix.

That makes me wonder how it ever worked for prefixes that had only one suffix, which is the way all my earliest tests worked. If `max` is 0, `rand` returns a real number between 0.0 and 1.0. I’m not sure how this works for array indexes, so I try it out in irb:

``````middlestate:~/Development/markov_tdd/markov/test bhbyars\$ irb
>> test = 
=> 
>> test[0.5]
=> 1
>> test[0.3]
=> 1
>> test[0.01]
=> 1
>> test[0.9]
=> 1``````

I just got lucky then (or unlucky); Ruby appears to always truncate the array index, masking my defect. Testers like to distinguish between faults and failures. I had a fault all along in my code, but until now it had not caused an observable failure. It’s an easy fix:

``````def random_suffix
suffixes = @suffixes[last_prefix]
return nil if suffixes.nil? || suffixes.length == 0
suffixes[rand(suffixes.length)]
end``````

You may have noticed that I added a line in the `speak` test to echo the `markov_text`. This was primarily for my own curiosity; I wanted to see what the text looked like. But I noticed something odd—the text wasn’t ending on ‘nevermore!’, which is the last word of the poem. The use of the sentinel values should force it to end on the last word in the source text, since the only empty prefix should be the last word plus the sentinel value. Let me test that:

``````def test_all_prefixes_have_suffix_except_last
keys = @markov.suffixes.keys.find_all {|key| @markov.suffixes[key] == []}
assert_equal ["#{the_raven.words.last} #{Markov::SENTINEL}"], keys
end``````

Here’s the result:

``````1) Failure:
test_all_prefixes_have_suffix_except_last:10
expected but was
.``````

I spot the error almost immediately. All the extra prefixes have a ? character in it, which has meaning to regular expressions. I need to escape the `Regexp`:

``````def find_suffixes_for(prefix)
pattern = Regexp.new("#{Regexp.escape(prefix)}(#{String::WORD_REGEXP.source})")
suffixes = []
text = @original_text

loop do
match = pattern.match(text)
break if match.nil?
suffixes << match
offset = match.offset(1)
text = text[offset..-1]
end
suffixes
end``````

While thinking about the prefix-suffix relationship, I realized that I had an unnecessary nil check in `random_suffix`; all prefixes should have suffixes, and only the last one should be empty:

``````def random_suffix
suffixes = @suffixes[last_prefix]
return nil if suffixes.length == 0
suffixes[rand(suffixes.length)]
end``````

Now all tests pass, and the output looks right. I note to myself how, while the simple unit tests did help me design the application, it took some real tests with real input before I caught some rather important defects.

And now I admire my creation:

Open here I flung the shutter, when, with many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door –
only this, and nothing more.

Written by Brandon Byars

May 6, 2007 at 12:40 pm

Posted in Ruby, TDD

Tagged with