A Day In The Lyf

…the lyf so short, the craft so longe to lerne

Archive for the ‘Ruby’ Category

Code Generation and Metaprogramming

I wanted to expand upon an idea that I first talked about in my previous post on Common Lisp. There is a common pattern between syntactic macros, runtime metaprogramming, and static code generation.

Runtime metaprogramming is code-generation. Just like C macros. Just like CL macros.

Ok, that’s a bit of an overstatement. Those three things aren’t really just like each other. But they are definitely related—they all write code that you’d rather not write yourself. Because it’s boring. And repetitious. And ugly.

In general, there are three points at which you can generate code in the development process, although the terminology leaves something to be desired: before compilation, during compilation (or interpretation), and during runtime. In the software development vernacular, only the first option is typically called code-generation (I’ll call it static code generation to avoid confusion). Code generation during compilation goes under the moniker of a ‘syntactic macro,’ and I’m calling runtime code generation ‘runtime metaprogramming.’

Since the “meta” in metaprogramming implies writing code that writes code, all three forms of code generation can be considered metaprogramming, which is why I snuck the “runtime” prefix into the third option above. Just in case you were wondering…

Static Code Generation

Static code generation is the easiest to understand and the weakest of the three options, but it’s often your only option due to language limitations. C macros are an example of static code generation, and it is the only metaprogramming option possible with C out-of-the box.

To take an example, on a previous project I generated code for lazy loading proxies in C#. A proxy, one of the standard GoF design patterns, sits in between a client and an object and intercepts messages that the client sends to the object. For lazy loading, this means that we can instantiate a proxy in place of a database-loaded object, and the client can use it without even knowing that it’s using a proxy. For performance reasons, the actual database object will only be loaded on first access of the proxy. Here’s a truncated example:

public class OrderProxy : IOrder
{
    private IOrder proxiedOrder = null;
    private long id;
    private bool isLoaded = false;

    public OrderProxy(long id)
    {
        this.id = id;
    }

    private void Load()
    {
        if (!isLoaded)
        {
           proxiedOrder = Find();
           isLoaded = true;
        }
    }

    private IOrder Find()
    {
        return FinderRegistry.OrderFinder.Find(id);
    }

    public string OrderNumber
    {
        get
        {
           Load();
           return proxiedOrder.OrderNumber;
        }
        set
        {
           Load();
           proxiedOrder.OrderNumber = value;
        }
    }

    public DateTime DateSubmitted
    {
        get
        {
           Load();
           return proxiedOrder.DateSubmitted;
        }
    }
}

This code is boring to write and boring to maintain. Every time the interface changes, a very repetitious change has to be made in the proxy. To make it worse, we have to do this for every database entity we’ll want to load (at least those we’re worried about lazy-loading). All I’d really like to say is “make this class implement the appropriate interface, and make it a lazy-loading proxy.” Fortunately, since the proxy is supposed to be a drop-in replacement for any other class implementing the same interface, we can use reflection to query the interface and statically generate the proxy.

There’s an important limitation to generating this code statically. Because we’re doing this before compilation, this approach requires a separated interfaces approach, where the binary containing the interfaces is separate from the assembly we’re generating the proxies for. We’ll have to compile the interfaces, use reflection on the compiled assembly to generate the source code for the proxies, and compile the newly generated source code.

But it’s do-able. Simply load the interface using reflection:

public static Type GetType(string name, string nameSpace, string assemblyFileName)
{
    if (!File.Exists(assemblyFileName))
        throw new IOException("No such file");

    Assembly assembly = Assembly.LoadFile(Path.GetFullPath(assemblyFileName));
    string qualifiedName = string.Format(“{0}.{1}”, nameSpace, name);
    return assembly.GetType(qualifiedName, true, true);
}

From there it’s pretty trivial to loop through the properties and methods and recreate the source code for them on the proxy, with a call to Load before delegating to the proxied object.

Runtime Metaprogramming

Now it turns out that when I wrote the code generation code above, there weren’t very many mature object-relational mappers in the .NET space. Fortunately, that’s changed, and the code above is no longer necessary. NHibernate will lazy-load for you, using a similar proxy approach that I used above. Except, NHibernate will write the proxy code at runtime.

The mechanics of how this work are encapsulated in a nice little library called Castle.DynamicProxy. NHibernate uses reflection to read interfaces (or virtual classes) and calls DynamicProxy to runtime generate code using the Reflection.Emit namespace. In C#, that’s a difficult thing to do, which is why I wouldn’t recommend doing it unless you use DynamicProxy.

This is a much more powerful technique than static code generation. For starters, you no longer need two assemblies, one for the interfaces, and one for the proxies. But the power of runtime metaprogramming extends well beyond saving you a simple .NET assembly.

Ruby makes metaprogramming much easier than C#. The standard Rails object-relational mapper also uses proxies to manage associations, but the metaprogramming applies even to the model classes themselves (which are equivalent to the classes that implement our .NET interfaces). The truncated IOrder implementation above showed 3 properties: Id, OrderNumber, and DateSubmitted. Assuming we have those columns in our orders table in the database, then the following Ruby class completely implements the same interface:

class Order < ActiveRecord::Base
end

At runtime, The ActiveRecord::Base superclass will load the schema of the orders table, and for each column, add a property to the Order class of the same name. Now we really see the power of metaprogramming: it helps us keep our code DRY. If it’s already specified in the database schema, why should we have to specify it in our application code as well?

Syntactic Macros

It probably wouldn’t make much sense to generate lazy-loading proxies at compile time, but that doesn’t mean syntactic macros don’t have their place. Used appropriately, they can DRY up your code in ways that even runtime metaprogramming cannot.

Peter Seibel gives a good example of building a unit test framework in Common Lisp. The idea is that we’d like to assert certain code is true, but also show the asserted code in our report. For example:

pass ... (= (+ 1 2) 3)
pass ... (= (+ 1 2 3) 6)
pass ... (= (-1 -3) -4)

The code to make this work, assuming report-result is implemented correctly, looks like this:

(defun test-+ ()
  (report-result (= (+ 1 2) 3) '(= (+ 1 2) 3))
  (report-result (= (+ 1 2 3) 6) '(= (+1 2 3) 6))
  (report-result (= (+ -1 -3) -4) '(= (+ -1 -3) -4)))

Notice the ugly duplication in each call to report-result. We have the code that’s actually executed (the first parameter), and the quoted list to report (the second parameter). Runtime metaprogramming could not solve the problem because the first parameter will be evaluated before being passed to report-result. Static code-generation could remove the duplication, but would be ugly. We could DRY up the code at compile time, if only we had access to the abstract syntax tree. Fortunately, in CL, the source code is little more than a textual representation of the AST.

Here’s the macro that Seibel comes up with:

(defmacro check (&body forms)
  `(progn
    ,@(loop for f in forms collect `(report-result ,f ',f))))

Notice how the source code within the list (represented as the loop variable f) is both executed and quoted. The test now becomes much simpler:

(defun test-+ ()
  (check (= (+ 1 2) 3))
  (check (= (+ 1 2 3) 6))
  (check (= (+ -1 -3) -4)))

Summary

Finding ways to eliminate duplication is always A Good Thing. For a long time, if you were programming in a mainstream language, then static code generation was your only option when code generation was needed. Things changed with the advent of reflection based languages, particularly when Java and C# joined the list of mainstream languages. Even though their metaprogramming capability isn’t as powerful as languages like Smalltalk and Ruby, they at least introduced metaprogramming techniques to the masses.

Of course, Lisp has been around since, say, the 1950’s (I’m not sure how long macros have been around, however). Syntactic macros provide a very powerful way of generating code, even letting you change the language. But until more languages implement them, they will never become as popular as they should be.

Advertisements

Written by Brandon Byars

March 29, 2008 at 6:00 pm

Configuring ActiveRecord to work with SQL Server 2005

As much as possible, I like a zero-install configuration. In other words, I want to simply checkout a codebase, run an automated build process, and start working. Ideally, I’d like to be able to do that on a clean machine.

It doesn’t always work, of course. For instance, even though most of our production code is written in .NET, we use Ruby extensively for automation. Installing Ruby is one of those dependencies that we live with. But installing Ruby isn’t enough; we also need Rails (or at least ActiveRecord) for some data management scripts we have, Watir, and some fixes for ActiveRecord to work with SQL Server.

All of that can be done fairly easily by somebody who knows what they’re doing, but new developers often don’t know what they’re doing, and I strive to be dispensable. We wrote a script that configured
Ruby and the necessary libraries to work for our development environment.

First, we needed to install the necessary gems. This is quite easy to do on the command line, but it took me a little digging before I figured out how to do it in code:

needed_gems = ["rails"]

require "rubygems"
Gem.manage_gems

needed_gems.each do |gem|
  puts "Installing gem #{gem}..."
  Gem::GemRunner.new.run ["install", gem, "--include-dependencies"]
end

SQL Server requires an ADO adapter that doesn’t ship with Ruby. You can read about it here. All that’s needed is to download the ADO.rb file (which we have versioned next to our setup script) and copy it to the right place, creating the directory if needed:

setup_dir = File.dirname(__FILE__)
ruby_dir = "C:/ruby/lib/ruby"
ruby_version = 1.8

# The ADO adapter needed for using dbi (the standard db access)
# with SQL Server, which does not come with ruby by default.
# See http://wiki.rubyonrails.org/rails/pages/HowtoConnectToMicrosoftSQLServer.

puts "creating ADO adapter..."
ado_dir = "#{ruby_dir}/site_ruby/#{ruby_version}/DBD/ADO"
system "if not exist #{ado_dir} mkdir #{ado_dir}"
system "copy /Y #{setup_dir}/ADO.rb #{ado_dir}"

Finally, we use SQL Server 2005, and we want to use Windows Authentication for all of our Ruby scripts. Neither SQL Server 2005 nor Windows Authentication is supported by rails out of the box. The problem, described on the SQL Server wiki for rails, is the connection string rails builds. At first, we were taking the suggestions of some of the comments on the wiki and changing the sqlserver_adapter.rb file that ships with rails. This obviously isn’t ideal, so now we monkey-patch it in our code that accesses the database:

module DBI
  # We have to monkey patch this because the SQL Server adapter that comes
  # with rails (active_record\connection_adapters\sqlserver_adapter.rb)
  # doesn't work with SQL Server 2005 or with Integrated Security yet.
  class << self
    alias_method :original_connect, :connect

    def connect(driver_url, username, password)
      # Change to SQL 2005
      driver_url.sub! "SQLOLEDB", "SQLNCLI"

      # Change to Windows Authentication
      driver_url.sub! /User Id=[^;]*;Password=[^;]*;/, "Integrated Security=SSPI;"

      original_connect(driver_url, username, password)
    end
  end
end

And that’s it. You still can’t checkout the codebase and start working on a clean machine, but it’s not bad. Install Ruby, run setup.rb. All database code has been patched to deal with our environment.

Written by Brandon Byars

October 23, 2007 at 7:13 pm

Auto-merging fixes

Paul Gross recently blogged about a rake task to automatically merge changes to the trunk if the change was made in a branch. This seemed like a useful feature, even though we don’t use rake.

Fixing productions bugs and merging is no fun, but why not take some of the pain out of the process? Depending on the developer and their mood, we either fix the bug in the trunk and merge to the branch, or fix it it the branch and merge to the trunk. Where it really gets ugly is when we have to merge two release branches back, because we make our release branch a few days before actually pushing it to production. Any urgent bug fix requiring a merge during that time has to be merged both to the new release branch as well as to the release branch currently in production.

Using Paul’s code as a base, I wrote automerge.rb, which, by default, either merges to the trunk (if you’re in a branch), or merges to the latest release branch (if you’re in the trunk). Alternatively, you can pass a release number, and automerge.rb will merge to that release branch. In all cases, you have to have the working copy checked out on your machine, and, if you’re on Windows, you need to make sure to put patch in your path.

The script assumes that your directory structure looks something like the following:

/trunk
 /branches
   /Release-2.8
   /Release-2.9

The Release-major-minor format is just our branch naming standard; it’s easy to change.

if !ARGV[0].nil?
  dest_dir = "branches/Release-#{ARGV[0]}"
  replace_pattern = /(trunk|branches).*$/i
elsif %x[svn info].include? "branches"
  dest_dir = "trunk"
  replace_pattern = /branches.*$/i
elsif %x[svn info].include? "trunk"
  pattern = /^Release-(\d+).(\d+)\/$/

  branches = %x[svn list http://devtools01:8080/svn/lazarus/branches]
  releases = branches.split.find_all {|branch| branch =~ pattern}

  # sort them in reverse order by major, minor
  releases.sort! do |first, second|
    first_major, first_minor = pattern.match(first)[1..2]
    second_major, second_minor = pattern.match(second)[1..2]

    if first_major != second_major
      second_major.to_i <=> first_major.to_i
    else
      second_minor.to_i <=> first_minor.to_i
    end
  end

  dest_dir = "branches/#{releases[0]}"
  replace_pattern = /trunk/
end

puts "Merging changes into #{dest_dir}.  Don't forget to check these in."
dest_path = Dir.pwd.gsub(replace_pattern, dest_dir)
puts dest_path
system "svn diff | patch -p0 -d #{dest_path}"

The only tricky part here is figuring out the latest release branch, which is done by using the svn list command followed by a custom sort.

Written by Brandon Byars

October 22, 2007 at 8:06 pm

Posted in Configuration Management, Ruby

Tagged with ,

C# Enum Generation

Ayende recently asked on the ALT.NET mailing list about the various methods developers use to provide lookup values, with the question framed as one between lookup tables and enums. My own preference is to use both, but keep it DRY with code generation.

To demonstrate the idea, I wrote a Ruby script that generates a C# enum file from some metadata. I much prefer Ruby to pure .NET solutions like CodeSmith—I find it easier and more powerful (I do think CodeSmith is excellent if there is no Ruby expertise on the team, however). The full source for this example can be grabbed here.

The idea is simple. I want a straightforward and extensible way to provide metadata for lookup values, following the Ruby Way of convention over configuration. XML is very popular in the .NET world, but the Ruby world views it as overly verbose, and prefers lighter markup languages like YAML. For my purposes, I decided not to mess with markup at all (although I’m still considering switching to YAML—the hash of hashes approach describes what I want well). Here’s some example metadata:

enums = {
  'OrderType' => {},
  'MethodOfPayment' => {:table => 'PaymentMethod',},
  'StateProvince' => {:table => 'StateProvinces',
                      :name_column => 'Abbreviation',
                      :id_column => 'StateProvinceId',
                      :transformer => lambda {|value| value.upcase},
                      :filter => lambda {|value| !value.empty?}}
}

That list, which is valid Ruby code, describes three enums, which will be named OrderType, MethodOfPayment, and StateProvince. The intention is that, where you followed your database standards, you should usually be able to get by without adding any extra metadata, as shown in the OrderType example. The code generator will get the ids and enum names from the OrderType table (expecting the columns to be named OrderTypeId and Description) and create the enum from those values. As StateProvince shows, the table name and two column names can be overridden.

More interestingly, you can both transform and filter the enum names by passing lambdas (which are like anonymous delegates in C#). The ‘StateProvince’ example above will filter out any states that, after cleaning up any illegal characters, equal an empty string, and then it will upper case the name.

We use a pre-build event in our project to build the enum file. However, if you simply overwrite the file every time you build, you may slow down the build process considerably. MSBuild (used by Visual Studio) evidently sees that the timestamp has been updated, so it rebuilds the project, forcing a rebuild of all downstream dependent projects. A better solution is to only overwrite the file if there are changes:

require File.dirname(__FILE__) + '/enum_generator'

gen = EnumGenerator.new('localhost', ‘database-name’)
source = gen.generate_all(‘Namespace', enums)

filename = File.join(File.dirname(__FILE__), 'Enums.cs')
if Dir[filename].empty? || source != IO.read(filename)
  File.open(filename, 'w') {|file| file << source}
end

I define the basic templates straight in the EnumGenerator class, but allow them to be swapped out. In theory, the default name column and the default lambda for generating the id column name given the table name (or enum name) could be handled the same way. Below is the EnumGenerator code:

class EnumGenerator
  FILE_TEMPLATE = <<EOT
//------------------------------------------------------------------------------
// <auto-generated>
//     This code was generated by a tool from <%= catalog %> on <%= server %>.
//
//     Changes to this file may cause incorrect behavior and will be lost if
//     the code is regenerated.
// </auto-generated>
//------------------------------------------------------------------------------

namespace <%= namespace %>
{
    <%= enums %>
}
EOT

  ENUM_TEMPLATE = <<EOT
public enum <%= enum_name %>
{
<% values.keys.sort.each_with_index do |id, i| -%>
    <%= values[id] %> = <%= id %><%= ',' unless i == values.length - 1 %>
<% end -%>
}

EOT

  # Change the templates by calling these setters
  attr_accessor :enum_template, :file_template

  attr_reader :server, :catalog

  def initialize(server, catalog)
    @server, @catalog = server, catalog
    @enum_template, @file_template = ENUM_TEMPLATE, FILE_TEMPLATE
  end
end

The code generation uses erb, the standard Ruby templating language:

def transform(template, template_binding)
  erb = ERB.new(template, nil, '-')
  erb.result template_binding
end

template_binding describes the variables available to use in the template in much the same way that Castle Monorail’s PropertyBag describes the variables available to the views. The difference is that, because Ruby is dynamic, you don’t have to explictly add values to the binding. The rest of the code is shown below:

def generate(enum_name, attributes)
  table = attributes[:table] || enum_name
  filter = attributes[:filter] || lambda {|value| true}
  values = enum_values(table, attributes)
  values.delete_if {|key, value| !filter.call(value)}
  transform enum_template, binding
end

def generate_all(namespace, metadata)
  enums = ''
  metadata.keys.sort.each {|enum_name| enums << generate(enum_name, metadata[enum_name])}
  enums = enums.gsub(/\n/m, "\n\t").strip
  transform file_template, binding
end

private
def enum_values(table, attributes)
  sql = get_sql table, attributes
  @dbh ||= DBI.connect("DBI:ADO:Provider=SQLNCLI;server=#{server};database=#{catalog};Integrated Security=SSPI")
  sth = @dbh.execute sql
  values = {}
  sth.each {|row| values[row['Id']] = clean(row['Name'], attributes[:transformer])}
  sth.finish

  values
end

def get_sql(table, attributes)
  id_column = attributes[:id_column] || "#{table}Id"
  name_column = attributes[:name_column] || "Description"
  "SELECT #{id_column} AS Id, #{name_column} AS Name FROM #{table} ORDER BY Id"
end

def clean(enum_value, transformer=nil)
  enum_value = '_' + enum_value if enum_value =~ /^\d/
  enum_value = enum_value.gsub /[^\w]/, ''
  transformer ||= lambda {|value| value}
  transformer.call enum_value
end

Caveat Emptor: I wrote this code from scratch today; it is not the same code we currently use in production. I think it’s better, but if you find a problem with it please let me know.

Written by Brandon Byars

October 21, 2007 at 9:54 pm

Posted in .NET, Code Generation, Ruby

Tagged with ,

TDD’ing a Markov Chain

In The Practice of Programming, Brian Kernighan and Rob Pike develop a simple algorithm to scramble text, yet do so in a way that helps prevent turning the output into gibberish. The idea is simple to understand:

  • Parse the input into a list of prefixes. A prefix of two words long seems to work well.
  • For each prefix in the input text, keep a list of suffixes, where a suffix is the word immediately following the prefix. If the same suffix exists in the input multiple times for a prefix, it should be listed multiple times in the suffix list.
  • To create the output text, starting with an initial prefix, randomly select a suffix for that prefix. Then slide your prefix over one word so that it now includes the last word of the previous prefix and the new suffix. Rinse, lather, repeat.

Kernighan and Pike suggest using sentinel values to help start and end the process. The algorithm is a simple example of a Markov chain algorithm.

I decided to give a Ruby implementation a go in a TDD fashion to see if I could learn something. I had the advantage of having already seen a few implementations by Kernighan and Pike, but to keep it somewhat fair, I didn’t consult the book during the implementation.

If you want to follow along with the code, you can download it here. To give you a sense of where we’re going, here’s the first stanza of Edgar Allen Poe’s The Raven, after scrambling:

Once upon a bust of Pallas just above my chamber door; –
This it is, and this mystery explore; –
‘Tis the wind and nothing more.

As always, the hardest part is just getting started. On string algorithms, I find starting with an empty string keeps me in my comfort zone:

class EmptyStringMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new ''
  end

  def test_speak
    assert_equal '', @markov.speak
  end
end

Test::Unit, like most members of the xUnit family that work in reflection-equipped languages, relies on a simple naming convention—all methods starting with ‘test_’ are executed. It fails, of course, because Markov doesn’t exist yet. This should do the trick:

class Markov
  def initialize(text)
  end

  def speak
    ''
  end
end

In my experience, developers who struggle to understand TDD, or don’t want to understand TDD, look at the above code and wonder what the point is. I haven’t really tested anything, because I haven’t really implemented anything. However, I’ve designed the simplest thing I can think of that gets me what I want. Design is hard, and simple design is even harder (don’t confuse simple with easy).

The code above isn’t hard at all. This is the whole point of TDD, but many people struggle to grasp it. Some people, including those in the BDD camp, point out that many people misunderstand TDD because it has the word ‘test’ it. Were I writing a TDD logo, it’d look something like this: test DRIVEN development.

I know I need to parse the input into a list of prefixes. I’ll try that next.

def test_prefixes
  assert_equal ["#{Markov::SENTINEL} #{Markov::SENTINEL}"],
    @markov.prefixes
end

Surrounding the original text by sentinel values is one of those tricks that, once you see it, makes perfect sense. Of course, having read Kernighan & Pike’s examples, I’d already seen the trick. Since I’m in a dynamic language, I can run the test right now and see it fail with a NoMethodError. However, I think it’s more valuable to go ahead and add the missing constant and method, and make it fail because the implementation is wrong, not just because it’s missing. I added the following to Markov:

attr_reader :prefixes
SENTINEL = ""

def initialize(text)
  @prefixes = []
end

If you’re not a Ruby programmer, the attr_reader method dynamically creates a prefixes getter method that returns the instance variable of the same name (the @ sign indicates an instance variable, and the [] indicates an empty array). This fails, of course, but it’s easy to fix using Ruby’s string interpolation:

def initialize(text)
  @prefixes = ["#{SENTINEL} #{SENTINEL}"]
end

Yes, I know, I’m cheating. However, my tests pass, and I’d prefer having the tests force me to move away from hard-coding results in the system under test. To do that, I probably need to move away from the empty string:

class OneWordMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new 'word'
  end

  def test_speak
    assert_equal 'word', @markov.speak
  end
end

I gave this test its own fixture. This is something Dave Astels and the BDD community recommend; if your tests require different state than tests you’ve already written, don’t hesitate to create a new class for it. Sticking to the one test class per production class is a smell.

After giving it a moment’s thought, I realize I can keep cheating. Maybe this wasn’t such a great test after all. That’s OK; realizing that development is a herky-jerky activity is part of the strength of TDD. Here’s the changed Markov class:

def initialize(text)
  @original_text = text
  @prefixes = ["#{SENTINEL} #{SENTINEL}"]
end

def speak
  @original_text
end

Similarly, for the prefixes:

def test_prefixes
  assert_equal ["#{Markov::SENTINEL} word ", "word #{Markov::SENTINEL}"],
    @markov.prefixes
end

Notice that I’m including trailing whitespace as part of the prefix. I also intend to include trailing punctuation, so I don’t have to be too clever about parsing.

Finally my hard-coding ways have caught up with me. Now, for the first time, I actually have to think a little about implementation. I think that’s A Good Thing—interface first, implementation second. Here’s the changed code:

class Markov
  attr_reader :prefixes

  SENTINEL = ""
  WORDS_IN_PREFIX = 2

  def initialize(text)
    @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
    build_prefixes
  end

  def speak
    @original_text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
  end

  private
    def build_prefixes
      @prefixes = []
      words = @original_text.scan /\S+\s*/
      max_index = words.length - WORDS_IN_PREFIX
      (0..max_index).each do |i|
        prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
        @prefixes << prefix unless @prefixes.include?(prefix)
      end
    end
end

I added the sentinels around the input text and parsed it into a list of two-word prefixes (The scan method, which you can read about in the Pickaxe book, returns an array of all non-overlapping substrings matching the regular expression. Regular expressions are delimited by slashes). The speak method now has to remove the sentinels. This fixes my new test, but breaks my empty string prefix test. An empty string will leave two spaces between the sentinels (look at the constructor above—there’s a sentinel+space before the text, and a space+sentinel after the text). I fix the test and get back to green.

It occurs to me that I probably overdid build_prefixes. While I think it’s right, there some code in it, like checking for duplicate prefixes, that isn’t strictly dictated by the tests yet. TDD takes discipline to not add code that isn’t demanded by failing tests. Figuring out how to do that isn’t always easy, but in this case it shouldn’t be hard. I just got carried away.

I think it’s still too big of a jump to start expecting random output text according to our Markov chain algorithm, so I start focusing on collecting the suffixes:

# In EmptyStringMarkovTest
def test_suffixes
  suffixes = {"#{Markov::SENTINEL}  #{Markov::SENTINEL}"=> []}
  assert_equal suffixes, @markov.suffixes
end

The suffixes variable above represents a hashtable, with an empty array as the value for the prefix key. I can make this one work by hard-coding, and since parsing the suffixes is non-trivial, I think I will:

# In Markov
attr_reader :prefixes, :suffixes

def initialize(text)
  @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
  build_prefixes
  @suffixes = {"#{SENTINEL} #{text} #{SENTINEL}" => []}
end

I’m using the same trick here I used for the prefix code above. I’m desigining the code around the simplest example I can think of, and hard-coding a result until forced otherwise. That helps me focus on interface over implementation, without getting bogged down in tricky parsing code. Here, the interface is using a hashtable for suffixes. Now, let’s force some implementation:

# In OneWordMarkovTest
def test_suffixes
  suffixes = {
    "#{Markov::SENTINEL} word " => ["#{Markov::SENTINEL}"],
    "word #{Markov::SENTINEL}" => []
  }
  assert_equal suffixes, @markov.suffixes
end

This one requires some thought:

def initialize(text)
  @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
  build_prefixes
  build_suffixes
end

private
def build_suffixes
  @suffixes = {}
  @prefixes.each do |prefix|
    @suffixes[prefix] = find_suffixes_for(prefix)
  end
end

You’ll see here the same “fake it until you make it” attitude that I used to hard-code test results. I don’t know how to find the suffixes for a prefix, so I’ll just pretend like the function to do it is already written. That makes the build_suffixes easy, even if it doesn’t work yet. Various authors in the Agile community, including Ron Jeffries and Brian Marick, have recommended using this approach, and I can attest to its effectiveness. Really, it’s the same principle behind divide and conquer algorithms.

Here’s my first attempt at find_suffixes_for:

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{prefix}(\\S+\\s*)")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

This gets the test to pass. The match object returned by matching a regular expression returns the entire string as match[0], the first matched group (delimited by parenthesis in the regular expression) in match[1], etc. If we find a match, we add it to our suffixes array (using the << appending operator) and slice text to start right after the first character of the previous match.

Up until now, I haven’t had to do any refactoring, but it is the crucial last step to each test-code-refactor iteration. The design benefits of TDD depend on it, as it is what enables the ‘evolution’ in ‘evolutionary design’. Let’s face it: everybody loves to refactor. But refactoring takes discipline. Refactoring without tests is a lot like wearing a chastity belt—expect an initial surge of excitement to be followed by a prolonged period of frustration.

Looking over the code, I’m a bit bothered by the duplicated regular expression used to define the words in the text. Looks like I need to extract a constant:

# In Markov
WORD_REGEXP = /\S+\s*/

def build_prefixes
  @prefixes = []
  words = @original_text.scan WORD_REGEXP
  max_index = words.length - WORDS_IN_PREFIX
  (0..max_index).each do |i|
    prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
    @prefixes << prefix unless @prefixes.include?(prefix)
  end
end

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

I’ve pretty much exhausted my options with empty strings and one word strings. I have a good start on parsing the prefixes and suffixes, so maybe it’s time to start with the Markov chain. First, I need some input that forces the result of the speak method to be different than the original text. Since suffixes are randomly chosen, using a small string may not provide enough data to statistically guarantee a different string.

class StatisticalMarkovTest < Test::Unit::TestCase
  def setup
    @text = ('a b c ' * 10) + 'a b d '
    @text *= 1000
    @text = @text.strip
    @markov = Markov.new @text
  end

  def test_prefixes
    prefixes = ["#{Markov::SENTINEL} a ", "a b ", "b c ",
      "c a ", "b d ", "d a ", "d #{Markov::SENTINEL}"]
    assert_equal prefixes, @markov.prefixes
  end

  def test_suffixes
    assert_equal 10000, suffixes_with("c ")
    assert_equal 1000, suffixes_with("d ")
  end

  def suffixes_with(suffix)
    @markov.suffixes["a b "].find_all {|s| s == suffix}.length
  end
end

I wrote the prefixes test just to make sure I understood what they were. It passed the first time. I wrote the suffixes method to give me a little more confidence in my parsing. It also passed the first time. I’m not entirely sure either test was really useful, and it’s quite possible that I could have just been procrastinating.

def test_weighted_random
  markov_text = @markov.speak
  assert_not_equal @text, markov_text
end

I’m not done with this test, but I’ll leave it like this until I get back to green. After quite a bit of experimenting (encompassing more time than I usually like to go between passing tests), I come up with the following:

class Markov
  attr_reader :prefixes, :suffixes

  SENTINEL = ""
  WORDS_IN_PREFIX = 2
  WORD_REGEXP = /\S+\s*/

  def initialize(text)
    @original_text = "#{SENTINEL} #{text} #{SENTINEL}"
    build_prefixes
    build_suffixes
  end

  def speak
    text = words(@original_text)[0..1].join
    text += random_suffix_for(text) until random_suffix_for(text).nil?
    text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
  end

  private
  def build_prefixes
    @prefixes = []
    words = words(@original_text)
    max_index = words.length - WORDS_IN_PREFIX
    (0..max_index).each do |i|
      prefix = words[i..(i+WORDS_IN_PREFIX-1)].join
      @prefixes << prefix unless @prefixes.include?(prefix)
    end
  end

  def build_suffixes
    @suffixes = {}
    @prefixes.each do |prefix|
      @suffixes[prefix] = find_suffixes_for(prefix)
    end
  end

  def find_suffixes_for(prefix)
    pattern = Regexp.new("#{prefix}(#{WORD_REGEXP.source})")
    suffixes = []
    text = @original_text

    loop do
      match = pattern.match(text)
      break if match.nil?
      suffixes << match[1]
      offset = match.offset(1)[0]
      text = text[offset..-1]
    end
    suffixes
  end

  def random_suffix_for(text)
    words = words(text)
    prefix = words[(words.length-WORDS_IN_PREFIX)..-1].join
    suffixes = @suffixes[prefix]
    return nil if suffixes.nil? || suffixes.length == 0
    suffixes[rand(suffixes.length-1)]
  end

  def words(text)
    # We're including punctuation as part of the word
    text.scan(WORD_REGEXP)
  end
end

I refactored the scan call into a words method while I was working because I need it in several places. This code works, but slowly. That’s OK—I’ll deal with that later.

This was a big jump, maybe too big. I might have been better off trying to come up with smaller steps. But the deed is done; let’s see if we can clean it up. Gritting my teeth, I decide to ignore the performance issue and go for some refactoring. The first thing I see is that I can make parsing the prefixes a bit more intention-revealing:

def speak
  text = first_prefix(@original_text)
  text += random_suffix_for(text) until random_suffix_for(text).nil?
  text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
  prefix = last_prefix(text)
  suffixes = @suffixes[prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

def last_prefix(text)
  words = words(text)
  words[(words.length-WORDS_IN_PREFIX)..-1].join
end

def first_prefix(text)
  words(text)[0..1].join
end

I make the change and run the tests. 106 seconds later, I see I’m still on green. Fine, I guess it’s time to deal with performance.

It’s important to note that this is the first time I’ve really considered performance in the implementation. As Kernighan and Pike say, the first rule of performance tuning is don’t. Even more caustic is Knuth’s (or Hoare’s?) comment that “premature optimization is the root of all evil.” It’s an evil I’ve committed too often in my career, and every time the software has been the worse for it. Write your code as cleanly and intention-revealing as possible. If, and only if, there are performance problems do you start optimizing. Remember, about 4% of your program will account for about 50% of the runtime; optimizing because you can means your probably wasting your time in the other 96% of the code.

I break out the profiler and run my slow test (ruby -r profile statistical_markov_test.rb -n weighted_random). Here’s a sample of the top of the output:

1 tests, 1 assertions, 0 failures, 0 errors
%   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
23.56   12.24     12.24        7  1748.57  2685.71  Kernel.loop
17.99   21.59      9.35     5668     1.65     1.65  String#scan
16.42   30.12      8.53        1  8530.00 18550.00  Range#each
8.95    34.77      4.65    33001     0.14     0.21  Array#include?
4.31    37.01      2.24   103016     0.02     0.02  String#==
3.50    38.83      1.82    77379     0.02     0.02  Array#[]
3.41    40.60      1.77     5665     0.31     2.44  Markov#random_suffix_for

It looks like the String#scan method is my main problem (Ruby documenters use the # character to indicate an instance method). That makes sense; it’s used in my words method, which is called every time a prefix is requested in random_suffix_for.

My first thought is to cache:

def words(text)
  # We're including punctuation as part of the word
  # This is the bottleneck of the algorithm, hence the caching
  @words ||= {}
  @words[text] ||= text.scan(WORD_REGEXP)
  @words[text]
end

The best thing I can say about this is that it’s not the stupidest thing I’ve ever done. After wrestling with my machine for several minutes to regain control because of the massive pagefile swapping, I’m finally able to kill the ruby process running my tests.

Looking at the code I’m almost embarrassed to comment how the caching takes up memory but doesn’t actually get used. The text used as the key into the hashtable will always be different when getting suffixes. I’m shaking my head in shame. I really wish you had stopped me from making such a public fool of myself.

Now that I’m seeing things a little bit more clearly, I notice how the text that’s passed in is monotonically increasing in length, adding one suffix each time. Rather than scanning each time, we could scan once, and then add each suffix manually:

def speak
  @text = first_prefix(@original_text)
  @words = words(@text)
  while (suffix = random_suffix_for(@text)) != nil
    @text << suffix
    @words << suffix
  end
  @text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix_for(text)
  prefix = last_prefix
  suffixes = @suffixes[prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

def last_prefix
  @words[(@words.length-WORDS_IN_PREFIX)..-1].join
end

This runs in a respectable 5 seconds, which, if memory serves, puts it roughly in the ballpark Kernighan and Pike had.

Notice how, by keeping more state, I’ve eliminated the parameter to last_prefix. I think I can do the same thing for random_suffix_for:

def speak
  @text = first_prefix(@original_text)
  @words = words(@text)
  while (suffix = random_suffix) != nil
    @text << suffix
    @words << suffix
  end
  @text.gsub Regexp.new("\\s*#{SENTINEL}\\s*"), ''
end

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length-1)]
end

Scanning the rest of the code, I notice an egregious bit of duplication that I had somehow missed before:

def first_prefix(text)
  words(text)[0..1].join
end

Where did the 1 come from in the array slice? It’s really this:

def first_prefix(text)
  words(text)[0..(WORDS_IN_PREFIX-1)].join
end

I feel pretty close now, and I think it’s time to do some real testing. One of the things I remember Kernighan and Pike doing is verifying that, if ‘a b c ’ outnumbers ‘a b d ’ 10:1, then we should see 10 times as many ‘c’ suffixes as ‘d’ suffixes. This is where I was headed with the test_weighted_random method. However, I’ll need to count the number of occurrences of a substring, and I don’t see a library method to help me. No problem, this is Ruby after all. Here’s the code I ended up with after a couple TDD cycles, which have been compressed for your viewing convenience:

class StringTest < Test::Unit::TestCase
  def test_zero_occurrences
    assert_equal 0, 'the quick brown fox'.occurrences_of('cat')
  end

  def test_occurrences_in_empty_string
    assert_equal 0, ''.occurrences_of('a')
    assert_equal 1, ''.occurrences_of('')
  end

  def test_occurrences_of
    text = ('a b c ') * 10 + 'a b d'
    assert_equal 10, text.occurrences_of('a b c')
    assert_equal 1, text.occurrences_of('a b d')
  end
end

class String
  def occurrences_of(substring)
    count, i = 0, 0
    loop do
      i = self.index(substring, i)
      break if i.nil?
      count += 1
      i += 1
    end
    count
  end
end

Adding a method to an already existing class like this is known as aspect-oriented introduction. It’s a difficult, and sometimes impossible feat to do in many languages. In Ruby, it’s that easy.

Now I can change my statistical test:

def test_weighted_random
  markov_text = @markov.speak
  assert_not_equal @text, markov_text
  count_bc = markov_text.occurrences_of('b c')
  count_bd = markov_text.occurrences_of('b d')
  ratio_bc_to_bd = count_bc / count_bd
  assert ratio_bc_to_bd.between?(9, 11),
    "there are #{ratio_bc_to_bd} times as many 'b c' fragments as 'b d' fragments (should be ~10)"
end

I’m pleased to see it work the first time. I decide to test with some real input now. I’d like to verify that every three word combination (2-word prefix + suffix) in the output also exists in the source text. To do that, I’ll need access to the words method. I decide it probably belongs in the String class too, so I move it and make a few tests for it:

class String
  WORD_REGEXP = /\S+\s*/

  def words
    # We're including punctuation as part of the word
    scan(WORD_REGEXP)
  end
end

class StringTest < Test::Unit::TestCase
  def test_words_in_empty_string
    assert_equal [], ''.words
  end

  def test_words
    assert_equal ['the ', 'quick ', 'brown ', 'fox'],
      'the quick brown fox'.words
  end

  def test_words_include_punctuation
    assert_equal ['first, ', 'second!  ', 'third? ', 'fourth...'],
      'first, second!  third? fourth...'.words
  end
end

Now I grab some real input and give it a go:

class ComplexMarkovTest < Test::Unit::TestCase
  def setup
    @markov = Markov.new the_raven
  end

  def test_speak
    markov_text = @markov.speak
    puts markov_text
    assert_not_equal the_raven, markov_text, "should scramble text"

    # each (prefix + suffix) combo must also be in original text
    words = markov_text.words
    phrase_size = Markov::WORDS_IN_PREFIX + 1
    (0..words.length-phrase_size).each do |i|
      phrase = words[i..(i+phrase_size-1)].join
      assert the_raven.include?(phrase), " not in original"
    end
  end

  def the_raven
    # Returns the entire text of Edgar Allen Poe's "The Raven"
    # download the source code if you want to see the whole thing...
  end
end

I’m a bit surprised to see this one fail. After a good bit of playing around, I discover that I had an off-by-one error because I misunderstood how the library rand method worked. According to the documentation, if the parameter max is an integer greater than 0, then it returns a random integer in the range [0..max-1]). I was already subtracting one, so I was always leaving off the last suffix.

That makes me wonder how it ever worked for prefixes that had only one suffix, which is the way all my earliest tests worked. If max is 0, rand returns a real number between 0.0 and 1.0. I’m not sure how this works for array indexes, so I try it out in irb:

middlestate:~/Development/markov_tdd/markov/test bhbyars$ irb
>> test = [1]
=> [1]
>> test[0.5]
=> 1
>> test[0.3]
=> 1
>> test[0.01]
=> 1
>> test[0.9]
=> 1

I just got lucky then (or unlucky); Ruby appears to always truncate the array index, masking my defect. Testers like to distinguish between faults and failures. I had a fault all along in my code, but until now it had not caused an observable failure. It’s an easy fix:

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.nil? || suffixes.length == 0
  suffixes[rand(suffixes.length)]
end

You may have noticed that I added a line in the speak test to echo the markov_text. This was primarily for my own curiosity; I wanted to see what the text looked like. But I noticed something odd—the text wasn’t ending on ‘nevermore!’, which is the last word of the poem. The use of the sentinel values should force it to end on the last word in the source text, since the only empty prefix should be the last word plus the sentinel value. Let me test that:

def test_all_prefixes_have_suffix_except_last
  keys = @markov.suffixes.keys.find_all {|key| @markov.suffixes[key] == []}
  assert_equal ["#{the_raven.words.last} #{Markov::SENTINEL}"], keys
end

Here’s the result:

1) Failure:
test_all_prefixes_have_suffix_except_last:10
 expected but was
.

I spot the error almost immediately. All the extra prefixes have a ? character in it, which has meaning to regular expressions. I need to escape the Regexp:

def find_suffixes_for(prefix)
  pattern = Regexp.new("#{Regexp.escape(prefix)}(#{String::WORD_REGEXP.source})")
  suffixes = []
  text = @original_text

  loop do
    match = pattern.match(text)
    break if match.nil?
    suffixes << match[1]
    offset = match.offset(1)[0]
    text = text[offset..-1]
  end
  suffixes
end

While thinking about the prefix-suffix relationship, I realized that I had an unnecessary nil check in random_suffix; all prefixes should have suffixes, and only the last one should be empty:

def random_suffix
  suffixes = @suffixes[last_prefix]
  return nil if suffixes.length == 0
  suffixes[rand(suffixes.length)]
end

Now all tests pass, and the output looks right. I note to myself how, while the simple unit tests did help me design the application, it took some real tests with real input before I caught some rather important defects.

And now I admire my creation:

Open here I flung the shutter, when, with many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door –
only this, and nothing more.

Written by Brandon Byars

May 6, 2007 at 12:40 pm

Posted in Ruby, TDD

Tagged with

Using Rails Fixtures For Storing Test Data

Yesterday I wrote about trying to retrofit agility into a large enterprise database. Even after we got our database scripted so that every developer could use a localhost database, we still had to switch back to an integration database too often to make it worthwhile. The problem was test data; we simply didn’t have enough to be useful, and it was difficult to maintain.

At first we tried SQL scripts, but quickly found them too hard to read. After creating a script to add data for one table, nary another script was written (random rant: why can’t the SQL INSERT statement follow the syntax of the UPDATE syntax? With big tables, the positional coupling between the column list and VALUES list is just too hard to maintain). Next, we tried a CSV file, which worked somewhat nicely because we could edit them in Excel. However, the BULK INSERT command we used to insert them caused too many problems. Any time a column was added, even if it was a nullable column, even if it was a computed column, you had to add it to the CSV file. And both the SQL file and the CSV files lacked the ability to add dynamic data. Many times we wanted to enter yesterday for a date, regardless of what actual date that happened to be. In some cases, we’d simply like to add 1000 rows, without regard to what data those rows contained.

Rails Fixtures

In the end, we hijacked some functionality from the Ruby on Rails testing framework. The Fixtures class allows you to make the following call in your tests to delete and re-add test data before each test:

fixtures :orders, :order_items

That one line of code will, by default, delete the order_items and orders tables (in that order), and then add all data from the orders.yml and order_items.yml file (in that order). YAML is preferred in the Ruby world for it’s readability; XML, like so many “enterprisey” technologies, is considered bulky and clumsy to the Rails team. Even better, the Rails code first runs your YAML files through ERB (a Ruby templating engine) before sending it to the YAML parser, so you can add dynamic content.

It turns out that using the fixtures code outside of Rails is really quite easy. First, we need to set up our connection information and define the necessary ActiveRecord classes. We can define all them in one file (I’ll call it driver.rb):

require 'active_record'
require 'active_record/fixtures'

ActiveRecord::Base.establish_connection(
  :adapter  => "sqlserver",
  :database =>  "db_name",
  :username => "db_user",
  :password => "password"
)

ActiveRecord::Base.logger = Logger.new(File.dirname(__FILE__) + "/debug.log")

class DisplayCategory < ActiveRecord::Base
  set_primary_key "DisplayCategoryId"
  set_table_name "DisplayCategories"
end

ActiveRecord is the Rails object-relational mapping framework, and like everything else in Rails, it’s built around the principle of “convention over configuration.” Since we’re dealing with a legacy database, and one not intended to be consumed by a Rails application, we’ll have to settle for a little bit of configuration.

By default, ActiveRecord expects the table name to be the plural version of the class name, with underscores between words (display_categories), and the primary key be an identity column called id. Our database has a different naming standard, with a healthy dose of standard violations, so we’ll have to add the call to set_primary_key and set_table_name to all of our ActiveRecord classes. DisplayCategories has an identity column, but I’ll show you an example below that does not.

When the script is parsed by Ruby, ActiveRecord will connect to the database and add a property for every column in the table. Metaprogramming is what allows ActiveRecord to be so DRY.

Then, we need the actual data. The Rails convention expects the YAML file to match the table name, so we’ll put the following in data/DisplayCategories.yml:

books:
  DisplayCategoryId: 1
  DisplayName: Books
  SortIndex: 1
fiction:
  DisplayCategoryId: 2
  DisplayName: Fiction
  SortIndex: 2
  ParentDisplayCategoryId: 1

You see how readable YAML is. The keys (books, fiction) are not inserted into the database. We’ll find a good use for them below for testing purposes, but for now we’ll simply use them to help us describe the record.

Let’s run it. The following code should work (in add_test_data.rb):

require 'driver.rb'
Fixtures.create_fixtures("data", [:DisplayCategories])

Unfortunately, we may get an error running add_test_data.rb. By default, the YAML parser simply dumps every record into a hashtable, with the identifier as the key. Since hashtables have no implicit ordering, we could run into a problem since our data requires that it be inserted in order (fiction has a ParentDisplayCategoryId foreign key pointing to books). If order doesn’t matter to you, use the above syntax. When it does matter, it’s a simple change:

--- !omap
- books:
    DisplayCategoryId: 1
    DisplayName: Books
    SortIndex: 1
- fiction:
    DisplayCategoryId: 2
    DisplayName: Fiction
    SortIndex: 2
    ParentDisplayCategoryId: 1

That omap syntax defines an ordered map, which solves our dependency problem. Now, running add_test_data.rb should work.

Dealing with Exceptions

DisplayCategories doesn’t quite meet the Rails conventions, but it’s close. What happens when we’re not close?

Let’s assume the following table definition:

CREATE TABLE PersonName (
  PersonNameId int NOT NULL
    CONSTRAINT PK_PersonName PRIMARY KEY,
  FirstName varchar(100) NOT NULL,
  LastName varchar(100) NOT NULL,
  FullName AS FirstName + ' ' + LastName
)

CREATE TRIGGER Audit_On_Update ON PersonName AFTER UPDATE
AS
INSERT INTO AuditPersonName(DateChanged, PersonNameId, FirstName, LastName)
SELECT GETDATE(), PersonNameId, FirstName, LastName
FROM deleted

Now, our configuration with ActiveRecord becomes more involved. There are four headaches we’ll run into here:

  • We don’t have an identity column
  • We have a computed column
  • Our table name isn’t plural
  • We have a trigger

Let’s tackle them one at a time. Instead of an identity column, let’s assume that we have to call a stored procedure called get_next_id to give us our next id as an output parameter. The following will work:

class PersonName < ActiveRecord::Base
  set_primary_key "PersonNameId"
  set_table_name "PersonName"

  def before_create
    sql <<-EOT
      DECLARE @id int
      EXEC get_next_id 'PersonName', @id OUTPUT
      SELECT @id
    EOT
    self.id = connection.select_value(sql).to_i
  end
end

We’re using one of the built-in hooks ActiveRecord provides to change how it gets the ids for us. In case you’ve never seen that EOT business, it’s called a here document; you can read about it here.

However, we still won’t be able to use that ActiveRecord implementation because it doesn’t understand computed columns, and will try to add FullName to the insert list. To work around that problem, add the following code to your PersonName class:

class PersonName
  @@computed_columns = ['FullName']

  def initialize(attributes = nil)
    super
    @@computed_columns.each { |column| @attributes.delete(column) }
  end
end

This is a bit hackish, but it gets the job done. I discovered the @attributes instance variable in ActiveRecord::Base when browsing for a solution, and found that I could simply remove the offending columns from it in the constructor.

Speaking of hacks, we can fix the problem of PersonName being singular by changing the way Rails understands the English language. Add the following to driver.rb:

require 'active_support/inflector'

Inflector.inflections do |inflect|
  inflect.uncountable 'PersonName'
end

The uncountable method was intended for words like fish and money. That’s ok, cheating seems like a pragmatic solution here. Chad Fowler mentions this solution in his book on Rails Recipes. Other options include:

Inflector.inflections do |inflect|
  inflect.plural /(ox)$/i, '\1en'
  inflect.singular /(ox)en/i, '\1'
  inflect.irregular 'person', 'people'
end

For more details, consult Fowler’s book, or dig through the source code. Now, add some test data and try to run it. We can add it to add_test_data.rb quite easily:

require 'driver.rb'
Fixtures.create_fixtures("data", [:DisplayCategories, :PersonName])

Try running it. If you’re using SQL Server, you should receive an exception that reads “Cannot create new connection because in manual or distributed transaction mode.” That’s because our trigger is getting in the way. Add SET NOCOUNT ON as the first line of your trigger, and try again. Everything should work now.

Using dynamic fixtures

I mentioned above that Fixtures first runs the YAML file through ERB before sending it to the parser. That means we can clean up our data to get rid of most of the noise for columns we don’t care about. The following is a valid YAML fixture:

<%
def ignore_values
  <<EOT
  Column1: 1
  UselessDate: 1/1/2000
  SomeNullColumn:
EOT
end
%>
identifier1:
  ImportantColumn: 1
<%= ignore_values %>
identifier2:
  ImportantColumn: 2
<%= ignore_values %>

Now, at least we’re only writing the columns we care about for each record. Sometimes, however, we can do a lot better. Imagine an address table that stores both a main and a shipping address for each person. The following would be a valid fixture:

<%
@next_id = 0

def addresses_for(name, attributes)
  "#{address_for(name, 'main', attributes)}#{address_for(name, 'shipping', attributes)}"
end

def address_for(name, type, attributes)
  @next_id += 1
  <<EOT
#{name}_#{type}:
  AddressId: #{@next_id}
  PersonId: #{attributes[:PersonId]}
  AddressTypeId: #{type == 'main' ? 1 : 2}
  Geocode: 
  Address1: #{attributes[:Address1]}
  Address2: 
  Address3: 
  City: #{attributes[:City]}
  State: #{attributes[:State]}
  Zip: #{attributes[:Zip]}
  County: #{attributes[:County]}
EOT
end
%>
<%= addresses_for('person1', {
  :PersonId => 1,
  :City => 'Marshall',
  :Address1 => '123 Some Road',
  :State => 'TX',
  :Zip => '75672',
  :County => 'Harrison'
}) %>
<%= addresses_for('person2', {
  :PersonId => 2,
  :City => 'Marshall',
  :Address1 => 'P.O. BOX 123',
  :State => 'TX',
  :Zip => '75672',
  :County => 'Harrison'
}) %>

That fixture will create 4 records, person1_main, person1_shipping, person2_main, and person2_shipping. Since we don’t care about having a separate billing address and shipping address for our test data, we can reduce the amount of information we have to add. And if we ever do care, it’s easy to write it out without calling our addresses_for method.

Testing

The Fixtures class was written for testing, and it has some nice features that give you access to each loaded record, for example, as an instance variable named after the record identifier (@person1, @fiction, etc). I had trouble getting that part to work with SQL Server, but even if you don’t ask Fixtures to load each record for you, you can still access them, which is quite useful if you’re using Ruby for any of your testing (for example, in Watir scripts). Let’s add the following to driver.rb:

require 'test/unit'

Test::Unit::TestCase.fixture_path = File.dirname(__FILE__) + "data/"
$LOAD_PATH.unshift(Test::Unit::TestCase.fixture_path)

class Test::Unit::TestCase
  self.use_transactional_fixtures = false
  self.use_instantiated_fixtures = :no_instances
end

Now, we should be able to load the fixtures automatically in our test cases:

require 'test/unit'
require 'watir'
require 'driver'

class ProductBrowsingTest < Test::Unit::TestCase
  fixtures :DisplayCategories

  def test_something
    # create a browser instance, and navigate to the right page
    browser.link(:text, @DisplayCategories["fiction"]["DisplayName"]).click
  end
end

Written by Brandon Byars

March 18, 2007 at 6:53 pm

Posted in Database, Ruby

Tagged with ,

Ruby and SQL DMO

We have a rather large legacy system that we are trying to inject a healthy dose of agility into. One of the biggest challenges has always been versioning the database, which is large, clumsy, and cluttered. We managed to get the rest of the code under a continuous integration scheme, but every time somebody made a schema change, or even updated a stored procedure needed for some tests to work, manual intervention was needed.

Pramod Sadalage and Martin Fowler wrote one of the first articles on iterative database design, and Pramod later teamed up with Scott Ambler to collaborate on Refactoring Databases. The advice, adopted by the Ruby on Rails team, was to create a separate migration file for each schema change, and number them sequentially. For example, the first migration would be 1.sql, then 2.sql, and so on. You could store the latest migration file run on a database in a version table, which would make updating a database as easy as running every migration, in order, whose version number is greater than the one stored in your database table.

We managed to do something similar, although it required a number of changes. But before it could all work, you need a starting point—a base schema to create a new developer database. We use SQL Server 2000, so initially I simply had Enterprise Manager create a SQL script for me. Not only did it not work (I don’t think it sorted the dependencies right), it was a ghastly nightmare to look at.

Why do standard vendor-supplied code generation tools create such ugly code?

I decided to do the code generation myself using SQL DMO (the same COM interfaces that Enterprise Manager was using, just poorly). I’d successfully used Ruby and ERB for code generation before, and discovered I could make very nice looking code (Code Generation in Action by Jack Herrington describes the concept nicely). Within just a couple hours, I had a SQL script that not only looked much nicer than anything Enterprise Manager spits out; but worked to boot.

First, I needed to connect to the database I wanted to dump into a SQL file:

require 'win32ole'

class SqlServer
  def initialize(server, catalog)
    sql = WIN32OLE.new("SQLDMO.SQLServer")
    sql.LoginSecure = 1
    sql.Connect(server)
    @db = sql.Databases(catalog)
  end
end

This uses Windows Authentication to connect to the appropriate SQL Server database. The DMO interface, described here, is one of those clumsy, wandering APIs commonly developed when VBScript is your target language (warning, the MSDN website for DMO only seems to work in IE, and poorly at that). I decided to wrap some of the classes to make it easier to use. First, a superclass:

class DBObject
  def initialize(object)
    @object = object
  end

  def name
    return @object.Name unless @object.Name =~ / /
    "[#{@object.Name}]"
  end

  def beautify(text)
    # strip brackets if they're not needed, and remove the .dbo prefix
    text.gsub!(/\[([^ ]*?)\]/, '\1')
    text.gsub(/dbo\./, '')
  end
end

Here I provide one of the keys to generating code that looks appealing. I really don’t like to look at all the noise of bracketing all the tables and column names, just in case the name contains a space in it. The following looks ugly to me:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[tablename] (
    [id] [int] NOT NULL,
    [description] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]

That drives me wild. I don’t ever change any of the settings for ANSI nulls, etc, and I could care less what the collation is so long as it matches the database default. Nor do I care about the file group. It’s hard to see through all the noise.

Here’s what I want:

CREATE TABLE tablename (
    id int NOT NULL
    description varchar(100) NOT NULL
)

Our name and beautify methods will help us achieve prettier code. Here’s the most important subclass:

class Table < DBObject
  attr_reader :columns, :constraints, :keys, :indexes, :references

  def initialize(table, connection)
    super(table)
    @columns, @constraints, @keys = [], [], []
    table.Columns.each { |col| @columns << Column.new(col) }
    table.Checks.each { |check| @constraints << CheckConstraint.new(check) }
    table.Keys.each { |key| @keys << Key.new(key, connection) }
    get_indexes(table)
    get_references
  end

  private
  def get_references
    @references = []
    @keys.each do |key|
      if "" != key.referenced_table and name != key.referenced_table
        @references << key.referenced_table unless references.include?(key.referenced_table)
      end
    end
  end

  def get_indexes(table)
    @indexes = []
    # the Indexes collection includes objects already in Keys and statistics
    keys = @keys.map { |key| key.name }
    table.Indexes.each do |index|
      if not keys.include?(index.Name)
        if index.Type == 16 || index.Type == 0
          @indexes << Index.new(index)
        end
      end
    end
  end
end

You can find the classes it depends on by downloading all of the code here. Notice, however, that a database connection is needed for the Key constructor. As far as I could tell, there was no way, using nothing more than DMO, to find out if a key cascade deletes. I had to query the INFORMATION_SCHEMA views to find that information.

For our script to work, we’re going to need to order our dependencies correctly. The SQL script will fail if we try to add a foreign key to a table that doesn’t yet exist. The following should do the trick:

class SqlServer
  # returns a topological sort with parent tables in front of child tables
  def self.topologically_sorted(tables)
    sorted = []

    # We need a hash to navigate the references field, which gives only names
    table_hash = {}
    tables.each { |table| table_hash[table.name] = table }

    # first add all root tables to sorted
    sorted += tables.find_all { |table| 0 == table.references.length }

    while tables.length < sorted.length
      sorted += tables.find_all do |table|
        if sorted.include?(table)
          result = FALSE
        else
          # all dependencies must already be in sorted
          dependencies = table.references.map { |ref| table_hash[ref] }
          result = (nil == dependencies.find { |ref| not sorted.include?(ref) })
        end
        result
      end
    end
    sorted
  end
end

Now, our code is as simple as binding to some ERB templates:

require 'erb'
require 'db_object'

class SchemaWriter
  def initialize(server, catalog)
    @db = SqlServer.new server, catalog
  end

  def generate_create_script
    generate_code_for(@db.user_defined_datatypes, "create_udt")
    generate_code_for(@db.rules, "create_rule")
    generate_code_for(SqlServer.topologically_sorted(@db.user_tables), "create_table")
  end

  def generate_code_for(objects, template_name)
    file_name = template_name + ".erb"
    template = ""
    File.open(file_name) { |file| template = file.read }
    objects.each do |object|
      erb = ERB.new(template, nil, '-')
      puts erb.result(binding)
    end
  end
end

if $0 == __FILE__
  writer = SchemaWriter.new(ARGV[0], ARGV[1])
  writer.generate_create_script
end

As an example, here’s the create.table.erb template:

create table <%= object.name %>(
<% object.columns.each_with_index do |column, i| -%>
    <%= column.name %> <%= column.text %><%= "," unless i == object.columns.length - 1 %>
<% end -%>
)

<% object.keys.each do |key| -%>
alter table <%= object.name %> add constraint <%= key.name %> <%= key.text %>
<% end -%>
<% object.constraints.each do |constraint| -%>
alter table <%= object.name %> add constraint <%= constraint.name %> <%= constraint.text %>
<% end -%>
<% object.indexes.each do |index| -%>
<%= index.text %>
<% end -%>

Written by Brandon Byars

March 17, 2007 at 3:28 pm

Posted in Code Generation, Database, Ruby

Tagged with ,