A Day In The Lyf

…the lyf so short, the craft so longe to lerne

Archive for May 2007

Big Methods Considered Harmful

Several years back, as a young programmer out of school who thought he understood OOP inside and out, I remember a conversation with a colleague about having to take over somebody else’s code. My colleague was upset because the original programmer used so many small methods that it was hard to figure out what anything was doing. Isn’t it so much easier, he rationalized (and I agreed), to just use a few methods, and a few objects, and make it obvious what you’re doing?

Years later, my colleague having moved on, we’re left with a mess of a system in certain parts—and those big method parts are now the hardest to understand, maintain, and extend. With the accretion of features, fixes, and cruft, some of those methods have morphed to over 1000 lines of code, and appear impervious to refactoring due to our complete inability to understand what the hell the method actually does. It’s a tremendous counter-example to our earlier rationalizations.

I now recognize the “bigger is better” attitude as the mark of an immature object-oriented developer, somebody who hasn’t understood the real power of OO yet. Kent Beck make the point vividly with his Composed Method pattern in Smalltalk Best Practice Patterns. Large methods do indeed make it easier to follow the flow of control, but they do so at the expense of flexibility and composability.

Small methods allow you to isolate assumptions. Small methods allow you to say things once and only once, leading to code that is DRY and elegant. Small methods let you easily see the big picture without getting lost in the details (our earlier naive fallacy was wanting to see the details up front). Small methods help you discover new responsibility—feature envy stands out more. Small methods help you isolate rates of change, keeping responsibilities that have to change in every subclass tucked away in one set of methods, and those that don’t have to change in another set of methods. Small methods allow you to see everything at the same level of abstraction. Small methods make most comments unnecessary. Small methods make unit testing easier, since the units are smaller. Small methods aid in creating cohesive systems, where each method has only one reason to change. And small methods make performance tuning easier.

Yes, small methods can help performance. A lot of people, particularly those from the C or C++ world, seem to have trouble believing that. It’s true that methods have some overhead to maintain the stack, but for 99.999% of applications the overhead that incurs simply isn’t worth worrying about, and if it is worth worrying about then you’re probably not writing in an object-oriented language anyhow. Algorithmic improvements are several orders of magnitude more important than inlining method calls. And small methods, which isolate assumptions so well, make algorithmic improvements easier to spot. Want to use a memoization cache? You’ll likely have to affect only one method.

The C language has macros, which are textually substituted in a preprocessing step to simulate function calling without incurring the overhead. Consider the following quote:

There is a tendency among older C programmers to write macros instead of functions for very short computations that will be executed frequently… The reason is performance: a macro avoids the overhead of a function call. This argument was weak even when C was first defined, a time of slow machines and expensive function calls; today it is irrelevant. With modern machines and compilers, the drawbacks of function macros outweigh their benefits.

The author of that quote is Brian Kernighan (The Practice of Programming), who also happens to be the co-author of the first book on the C language.

Many people think (or are even trained) that the only reason to break apart a method is if you want to reuse a part of it. That line of thinking is indefensible. The most difficult part of programming is not maximizing reuse; it’s minimizing complexity. Reuse is just one of the tools we use to minimize complexity; writing clean code that communicates well is another.

Written by Brandon Byars

May 28, 2007 at 10:43 am

Posted in Design, Readability

TDD’ing a Markov Chain

In The Practice of Programming, Brian Kernighan and Rob Pike develop a simple algorithm to scramble text, yet do so in a way that helps prevent turning the output into gibberish. The idea is simple to understand:

  • Parse the input into a list of prefixes. A prefix of two words long seems to work well.
  • For each prefix in the input text, keep a list of suffixes, where a suffix is the word immediately following the prefix. If the same suffix exists in the input multiple times for a prefix, it should be listed multiple times in the suffix list.
  • To create the output text, starting with an initial prefix, randomly select a suffix for that prefix. Then slide your prefix over one word so that it now includes the last word of the previous prefix and the new suffix. Rinse, lather, repeat.

Kernighan and Pike suggest using sentinel values to help start and end the process. The algorithm is a simple example of a Markov chain algorithm.

I decided to give a Ruby implementation a go in a TDD fashion to see if I could learn something. I had the advantage of having already seen a few implementations by Kernighan and Pike, but to keep it somewhat fair, I didn’t consult the book during the implementation.

If you want to follow along with the code, you can download it here. To give you a sense of where we’re going, here’s the first stanza of Edgar Allen Poe’s The Raven, after scrambling:

Once upon a bust of Pallas just above my chamber door; -
This it is, and this mystery explore; -
‘Tis the wind and nothing more.

As always, the hardest part is just getting started. On string algorithms, I find starting with an empty string keeps me in my comfort zone:

class EmptyStringMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new ''
  end

  def test_speak
    assert_equal '', @markov.speak
  end
end

Test::Unit, like most members of the xUnit family that work in reflection-equipped languages, relies on a simple naming convention—all methods starting with ‘test_’ are executed. It fails, of course, because Markov doesn’t exist yet. This should do the trick:

class Markov
  def initialize(text)
  end

  def speak
    ''
  end
end

In my experience, developers who struggle to understand TDD, or don’t want to understand TDD, look at the above code and wonder what the point is. I haven’t really tested anything, because I haven’t really implemented anything. However, I’ve designed the simplest thing I can think of that gets me what I want. Design is hard, and simple design is even harder (don’t confuse simple with easy).

The code above isn’t hard at all. This is the whole point of TDD, but many people struggle to grasp it. Some people, including those in the BDD camp, point out that many people misunderstand TDD because it has the word ‘test’ it. Were I writing a TDD logo, it’d look something like this: test DRIVEN development.

I know I need to parse the input into a list of prefixes. I’ll try that next.

def test_prefixes
  assert_equal ["#{Markov::SENTINEL} #{Markov::SENTINEL}"],
    @markov.prefixes
end

Surrounding the original text by sentinel values is one of those tricks that, once you see it, makes perfect sense. Of course, having read Kernighan & Pike’s examples, I’d already seen the trick. Since I’m in a dynamic language, I can run the test right now and see it fail with a NoMethodError. However, I think it’s more valuable to go ahead and add the missing constant and method, and make it fail because the implementation is wrong, not just because it’s missing. I added the following to Markov:

attr_reader :prefixes
SENTINEL = ""

def initialize(text)
  @prefixes = []
end

If you’re not a Ruby programmer, the attr_reader method dynamically creates a prefixes getter method that returns the instance variable of the same name (the @ sign indicates an instance variable, and the [] indicates an empty array). This fails, of course, but it’s easy to fix using Ruby’s string interpolation:

def initialize(text)
  @prefixes = ["#{SENTINEL} #{SENTINEL}"]
end

Yes, I know, I’m cheating. However, my tests pass, and I’d prefer having the tests force me to move away from hard-coding results in the system under test. To do that, I probably need to move away from the empty string:

class OneWordMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new 'word'
  end

  def test_speak
    assert_equal 'word', @markov.speak
  end
end

I gave this test its own fixture. This is something Dave Astels and the BDD community recommend; if your tests require different state than tests you’ve already written, don’t hesitate to create a new class for it. Sticking to the one test class per production class is a smell.

After giving it a moment’s thought, I realize I can keep cheating. Maybe this wasn’t such a great test after all. That’s OK; realizing that development is a herky-jerky activity is part of the strength of TDD. Here’s the changed Markov class:

def initialize(text)
  @original_text = text
  @prefixes = ["#{SENTINEL} #{SENTINEL}"]
end

def speak
  @original_text
end

Similarly, for the prefixes:

def test_prefixes
  assert_equal ["#{Markov::SENTINEL} word ", "word #{Markov::SENTINEL}"],
    @markov.prefixes
end

Notice that I’m including trailing whitespace as part of the prefix. I also intend to include trailing punctuation, so I don’t have to be too clever about parsing.

Finally my hard-coding ways have caught up with me. Now, for the first time, I actually have to think a little about implementation. I think that’s A Good Thing—interface first, implementation second. Here’s the changed code:

class Markov
  attr_reader :prefixes

  SENTINEL = ""
  WORDS_IN_PREFIX = 2

  def initialize(text)
    @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
    build_prefixes
  end

  def speak
    @original_text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
  end

  private
    def build_prefixes
      @prefixes = []
      words = @original_text.scan /\S+\s*/
      max_index = words.length - WORDS_IN_PREFIX
      (0..max_index).each do |i|
        prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
        @prefixes << prefix unless @prefixes.include?(prefix)
      end
    end
end

I added the sentinels around the input text and parsed it into a list of two-word prefixes (The scan method, which you can read about in the Pickaxe book, returns an array of all non-overlapping substrings matching the regular expression. Regular expressions are delimited by slashes). The speak method now has to remove the sentinels. This fixes my new test, but breaks my empty string prefix test. An empty string will leave two spaces between the sentinels (look at the constructor above—there’s a sentinel+space before the text, and a space+sentinel after the text). I fix the test and get back to green.

It occurs to me that I probably overdid build_prefixes. While I think it’s right, there some code in it, like checking for duplicate prefixes, that isn’t strictly dictated by the tests yet. TDD takes discipline to not add code that isn’t demanded by failing tests. Figuring out how to do that isn’t always easy, but in this case it shouldn’t be hard. I just got carried away.

I think it’s still too big of a jump to start expecting random output text according to our Markov chain algorithm, so I start focusing on collecting the suffixes:

# In EmptyStringMarkovTest
def test_suffixes
  suffixes = {"#{Markov::SENTINEL}  #{Markov::SENTINEL}"=> []}
  assert_equal suffixes, @markov.suffixes
end

The suffixes variable above represents a hashtable, with an empty array as the value for the prefix key. I can make this one work by hard-coding, and since parsing the suffixes is non-trivial, I think I will:

# In Markov
attr_reader :prefixes, :suffixes

def initialize(text)
  @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
  build_prefixes
  @suffixes = {"#{SENTINEL} #{text} #{SENTINEL}" => []}
end

I’m using the same trick here I used for the prefix code above. I’m desigining the code around the simplest example I can think of, and hard-coding a result until forced otherwise. That helps me focus on interface over implementation, without getting bogged down in tricky parsing code. Here, the interface is using a hashtable for suffixes. Now, let’s force some implementation:

# In OneWordMarkovTest
def test_suffixes
  suffixes = {
    "#{Markov::SENTINEL} word " => ["#{Markov::SENTINEL}"],
    "word #{Markov::SENTINEL}" => []
  }
  assert_equal suffixes, @markov.suffixes
end

This one requires some thought:

def initialize(text)
  @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
  build_prefixes
  build_suffixes
end

private
def build_suffixes
  @suffixes = {}
  @prefixes.each do |prefix|
    @suffixes[prefix] = find_suffixes_for(prefix)
  end
end

You’ll see here the same “fake it until you make it” attitude that I used to hard-code test results. I don’t know how to find the suffixes for a prefix, so I’ll just pretend like the function to do it is already written. That makes the build_suffixes easy, even if it doesn’t work yet. Various authors in the Agile community, including Ron Jeffries and Brian Marick, have recommended using this approach, and I can attest to its effectiveness. Really, it’s the same principle behind divide and conquer algorithms.

Here’s my first attempt at find_suffixes_for:

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{prefix}(\\S+\\s*)")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

This gets the test to pass. The match object returned by matching a regular expression returns the entire string as match[0], the first matched group (delimited by parenthesis in the regular expression) in match[1], etc. If we find a match, we add it to our suffixes array (using the << appending operator) and slice text to start right after the first character of the previous match.

Up until now, I haven’t had to do any refactoring, but it is the crucial last step to each test-code-refactor iteration. The design benefits of TDD depend on it, as it is what enables the ‘evolution’ in ‘evolutionary design’. Let’s face it: everybody loves to refactor. But refactoring takes discipline. Refactoring without tests is a lot like wearing a chastity belt—expect an initial surge of excitement to be followed by a prolonged period of frustration.

Looking over the code, I’m a bit bothered by the duplicated regular expression used to define the words in the text. Looks like I need to extract a constant:

# In Markov
WORD_REGEXP = /\S+\s*/

def build_prefixes
  @prefixes = []
  words = @original_text.scan WORD_REGEXP
  max_index = words.length - WORDS_IN_PREFIX
  (0..max_index).each do |i|
    prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
    @prefixes << prefix unless @prefixes.include?(prefix)
  end
end

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

I’ve pretty much exhausted my options with empty strings and one word strings. I have a good start on parsing the prefixes and suffixes, so maybe it’s time to start with the Markov chain. First, I need some input that forces the result of the speak method to be different than the original text. Since suffixes are randomly chosen, using a small string may not provide enough data to statistically guarantee a different string.

class StatisticalMarkovTest < Test::Unit::TestCase
  def setup
    @text = ('a b c ' * 10) + 'a b d '
    @text *= 1000
    @text = @text.strip
    @markov = Markov.new @text
  end

  def test_prefixes
    prefixes = ["#{Markov::SENTINEL} a ", "a b ", "b c ",
      "c a ", "b d ", "d a ", "d #{Markov::SENTINEL}"]
    assert_equal prefixes, @markov.prefixes
  end

  def test_suffixes
    assert_equal 10000, suffixes_with("c ")
    assert_equal 1000, suffixes_with("d ")
  end

  def suffixes_with(suffix)
    @markov.suffixes["a b "].find_all {|s| s == suffix}.length
  end
end

I wrote the prefixes test just to make sure I understood what they were. It passed the first time. I wrote the suffixes method to give me a little more confidence in my parsing. It also passed the first time. I’m not entirely sure either test was really useful, and it’s quite possible that I could have just been procrastinating.

def test_weighted_random
  markov_text = @markov.speak
  assert_not_equal @text, markov_text
end

I’m not done with this test, but I’ll leave it like this until I get back to green. After quite a bit of experimenting (encompassing more time than I usually like to go between passing tests), I come up with the following:

class Markov
  attr_reader :prefixes, :suffixes

  SENTINEL = ""
  WORDS_IN_PREFIX = 2
  WORD_REGEXP = /\S+\s*/

  def initialize(text)
    @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
    build_prefixes
    build_suffixes
  end

  def speak
    text = words(@original_text)[0..1].join
    text += random_suffix_for(text) until random_suffix_for(text).nil?
    text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
  end

  private
  def build_prefixes
    @prefixes = []
    words = words(@original_text)
    max_index = words.length - WORDS_IN_PREFIX
    (0..max_index).each do |i|
      prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
      @prefixes << prefix unless @prefixes.include?(prefix)
    end
  end

  def build_suffixes
    @suffixes = {}
    @prefixes.each do |prefix|
      @suffixes[prefix] = find_suffixes_for(prefix)
    end
  end

  def find_suffixes_for(prefix)
    pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
    suffixes = []
    text = @original_text

    loop do
      match = pattern.match(text)
      break if match.nil?
      suffixes << match[1]
      offset = match.offset(1)[0]
      text = text[offset..-1]
    end
    suffixes
  end

  def random_suffix_for(text)
    words = words(text)
    prefix = words[(words.length-WORDS_IN_PREFIX)..-1].join
    suffixes = @suffixes[prefix]
    return nil if suffixes.nil? || suffixes.length == 0
    suffixes[rand(suffixes.length-1)]
  end

  def words(text)
    # We're including punctuation as part of the word
    text.scan(WORD_REGEXP)
  end
end

I refactored the scan call into a words method while I was working because I need it in several places. This code works, but slowly. That’s OK—I’ll deal with that later.

This was a big jump, maybe too big. I might have been better off trying to come up with smaller steps. But the deed is done; let’s see if we can clean it up. Gritting my teeth, I decide to ignore the performance issue and go for some refactoring. The first thing I see is that I can make parsing the prefixes a bit more intention-revealing:

def speak
  text = first_prefix(@original_text)
  text += random_suffix_for(text) until random_suffix_for(text).nil?
  text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
  prefix = last_prefix(text)
  suffixes = @suffixes[prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

def last_prefix(text)
  words = words(text)
  words[(words.length-WORDS_IN_PREFIX)..-1].join
end

def first_prefix(text)
  words(text)[0..1].join
end

I make the change and run the tests. 106 seconds later, I see I’m still on green. Fine, I guess it’s time to deal with performance.

It’s important to note that this is the first time I’ve really considered performance in the implementation. As Kernighan and Pike say, the first rule of performance tuning is don’t. Even more caustic is Knuth’s (or Hoare’s?) comment that “premature optimization is the root of all evil.” It’s an evil I’ve committed too often in my career, and every time the software has been the worse for it. Write your code as cleanly and intention-revealing as possible. If, and only if, there are performance problems do you start optimizing. Remember, about 4% of your program will account for about 50% of the runtime; optimizing because you can means your probably wasting your time in the other 96% of the code.

I break out the profiler and run my slow test (ruby -r profile statistical_markov_test.rb -n weighted_random). Here’s a sample of the top of the output:

1 tests, 1 assertions, 0 failures, 0 errors
%   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
23.56   12.24     12.24        7  1748.57  2685.71  Kernel.loop
17.99   21.59      9.35     5668     1.65     1.65  String#scan
16.42   30.12      8.53        1  8530.00 18550.00  Range#each
8.95    34.77      4.65    33001     0.14     0.21  Array#include?
4.31    37.01      2.24   103016     0.02     0.02  String#==
3.50    38.83      1.82    77379     0.02     0.02  Array#[]
3.41    40.60      1.77     5665     0.31     2.44  Markov#random_suffix_for

It looks like the String#scan method is my main problem (Ruby documenters use the # character to indicate an instance method). That makes sense; it’s used in my words method, which is called every time a prefix is requested in random_suffix_for.

My first thought is to cache:

def words(text)
  # We're including punctuation as part of the word
  # This is the bottleneck of the algorithm, hence the caching
  @words ||= {}
  @words[text] ||= text.scan(WORD_REGEXP)
  @words[text]
end

The best thing I can say about this is that it’s not the stupidest thing I’ve ever done. After wrestling with my machine for several minutes to regain control because of the massive pagefile swapping, I’m finally able to kill the ruby process running my tests.

Looking at the code I’m almost embarrassed to comment how the caching takes up memory but doesn’t actually get used. The text used as the key into the hashtable will always be different when getting suffixes. I’m shaking my head in shame. I really wish you had stopped me from making such a public fool of myself.

Now that I’m seeing things a little bit more clearly, I notice how the text that’s passed in is monotonically increasing in length, adding one suffix each time. Rather than scanning each time, we could scan once, and then add each suffix manually:

def speak
  @text = first_prefix(@original_text)
  @words = words(@text)
  while (suffix = random_suffix_for(@text)) != nil
    @text << suffix
    @words << suffix
  end
  @text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
  prefix = last_prefix
  suffixes = @suffixes[prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

def last_prefix
  @words[(@words.length-WORDS_IN_PREFIX)..-1].join
end

This runs in a respectable 5 seconds, which, if memory serves, puts it roughly in the ballpark Kernighan and Pike had.

Notice how, by keeping more state, I’ve eliminated the parameter to last_prefix. I think I can do the same thing for random_suffix_for:

def speak
  @text = first_prefix(@original_text)
  @words = words(@text)
  while (suffix = random_suffix) != nil
    @text << suffix
    @words << suffix
  end
  @text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

Scanning the rest of the code, I notice an egregious bit of duplication that I had somehow missed before:

def first_prefix(text)
  words(text)[0..1].join
end

Where did the 1 come from in the array slice? It’s really this:

def first_prefix(text)
  words(text)[0..(WORDS_IN_PREFIX-1)].join
end

I feel pretty close now, and I think it’s time to do some real testing. One of the things I remember Kernighan and Pike doing is verifying that, if ‘a b c ’ outnumbers ‘a b d ’ 10:1, then we should see 10 times as many ‘c’ suffixes as ‘d’ suffixes. This is where I was headed with the test_weighted_random method. However, I’ll need to count the number of occurrences of a substring, and I don’t see a library method to help me. No problem, this is Ruby after all. Here’s the code I ended up with after a couple TDD cycles, which have been compressed for your viewing convenience:

class StringTest < Test::Unit::TestCase
  def test_zero_occurrences
    assert_equal 0, 'the quick brown fox'.occurrences_of('cat')
  end

  def test_occurrences_in_empty_string
    assert_equal 0, ''.occurrences_of('a')
    assert_equal 1, ''.occurrences_of('')
  end

  def test_occurrences_of
    text = ('a b c ') * 10 + 'a b d'
    assert_equal 10, text.occurrences_of('a b c')
    assert_equal 1, text.occurrences_of('a b d')
  end
end

class String
  def occurrences_of(substring)
    count, i = 0, 0
    loop do
      i = self.index(substring, i)
      break if i.nil?
      count += 1
      i += 1
    end
    count
  end
end

Adding a method to an already existing class like this is known as aspect-oriented introduction. It’s a difficult, and sometimes impossible feat to do in many languages. In Ruby, it’s that easy.

Now I can change my statistical test:

def test_weighted_random
  markov_text = @markov.speak
  assert_not_equal @text, markov_text
  count_bc = markov_text.occurrences_of('b c')
  count_bd = markov_text.occurrences_of('b d')
  ratio_bc_to_bd = count_bc / count_bd
  assert ratio_bc_to_bd.between?(9, 11),
    "there are #{ratio_bc_to_bd} times as many 'b c' fragments as 'b d' fragments (should be ~10)"
end

I’m pleased to see it work the first time. I decide to test with some real input now. I’d like to verify that every three word combination (2-word prefix + suffix) in the output also exists in the source text. To do that, I’ll need access to the words method. I decide it probably belongs in the String class too, so I move it and make a few tests for it:

class String
  WORD_REGEXP = /\S+\s*/

  def words
    # We're including punctuation as part of the word
    scan(WORD_REGEXP)
  end
end

class StringTest < Test::Unit::TestCase
  def test_words_in_empty_string
    assert_equal [], ''.words
  end

  def test_words
    assert_equal ['the ', 'quick ', 'brown ', 'fox'],
      'the quick brown fox'.words
  end

  def test_words_include_punctuation
    assert_equal ['first, ', 'second!  ', 'third? ', 'fourth...'],
      'first, second!  third? fourth...'.words
  end
end

Now I grab some real input and give it a go:

class ComplexMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new the_raven
  end

  def test_speak
    markov_text = @markov.speak
    puts markov_text
    assert_not_equal the_raven, markov_text, "should scramble text"

    # each (prefix + suffix) combo must also be in original text
    words = markov_text.words
    phrase_size = Markov::WORDS_IN_PREFIX + 1
    (0..words.length-phrase_size).each do |i|
      phrase = words[i..(i+phrase_size-1)].join
      assert the_raven.include?(phrase), " not in original"
    end
  end

  def the_raven
    # Returns the entire text of Edgar Allen Poe's "The Raven"
    # download the source code if you want to see the whole thing...
  end
end

I’m a bit surprised to see this one fail. After a good bit of playing around, I discover that I had an off-by-one error because I misunderstood how the library rand method worked. According to the documentation, if the parameter max is an integer greater than 0, then it returns a random integer in the range [0..max-1]). I was already subtracting one, so I was always leaving off the last suffix.

That makes me wonder how it ever worked for prefixes that had only one suffix, which is the way all my earliest tests worked. If max is 0, rand returns a real number between 0.0 and 1.0. I’m not sure how this works for array indexes, so I try it out in irb:

middlestate:~/Development/markov_tdd/markov/test bhbyars$ irb
>> test = [1]
=> [1]
>> test[0.5]
=> 1
>> test[0.3]
=> 1
>> test[0.01]
=> 1
>> test[0.9]
=> 1

I just got lucky then (or unlucky); Ruby appears to always truncate the array index, masking my defect. Testers like to distinguish between faults and failures. I had a fault all along in my code, but until now it had not caused an observable failure. It’s an easy fix:

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length)]
end

You may have noticed that I added a line in the speak test to echo the markov_text. This was primarily for my own curiosity; I wanted to see what the text looked like. But I noticed something odd—the text wasn’t ending on ‘nevermore!’, which is the last word of the poem. The use of the sentinel values should force it to end on the last word in the source text, since the only empty prefix should be the last word plus the sentinel value. Let me test that:

def test_all_prefixes_have_suffix_except_last
  keys = @markov.suffixes.keys.find_all {|key| @markov.suffixes[key] == []}
  assert_equal ["#{the_raven.words.last} #{Markov::SENTINEL}"], keys
end

Here’s the result:

1) Failure:
test_all_prefixes_have_suffix_except_last:10
 expected but was
.

I spot the error almost immediately. All the extra prefixes have a ? character in it, which has meaning to regular expressions. I need to escape the Regexp:

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{Regexp.escape(prefix)}(#{String::WORD_REGEXP.source})")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

While thinking about the prefix-suffix relationship, I realized that I had an unnecessary nil check in random_suffix; all prefixes should have suffixes, and only the last one should be empty:

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.length == 0
  suffixes[rand(suffixes.length)]
end

Now all tests pass, and the output looks right. I note to myself how, while the simple unit tests did help me design the application, it took some real tests with real input before I caught some rather important defects.

And now I admire my creation:

Open here I flung the shutter, when, with many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door -
only this, and nothing more.

Written by Brandon Byars

May 6, 2007 at 12:40 pm

Posted in Ruby, TDD

Tagged with

Follow

Get every new post delivered to your Inbox.