Skip to content

Overview of rexical part 2

Dmytro Shteflyuk edited this page Dec 7, 2022 · 2 revisions

Taken from Jeff Nyman's post - A Tester Learns Rex and Racc, Part 2

Making your lexer testable

In my previous post on this subject I started off on the learning process for building a lexer with the Rex (Rexical) tool. Here I want to update the logic I provided in that post to show how to make it more testable. I then want to expand on the example of using Rex with something a bit more substantive.

If you are following along from the last post, you would have the following in place:

  • A project directory called test_language.
  • A file in that directory called test_language.rex.
  • A file in that directory called test_language.rb.

This simple test Ruby script was not a good way to test and that an actual test file would be better. So here I’ll describe how I put that in place.

Within your project directory create another directory called spec.

  • Move your test_language.rb file into the spec directory.
  • Rename the test_language.rb file to language_lexer_spec.rb.
  • Create a file in your project directory called Rakefile.

Put the following logic in the Rakefile:

require 'rspec/core/rake_task'
 
RSpec::Core::RakeTask.new do |c|
  options = ['--color']
  options += ["--format", "documentation"]
  c.rspec_opts = options
end
 
desc "Generate Lexer"
task :lexer do
  `rex test_language.rex -o lexer.rb`
end

Now you can generate your lexer just by typing

rake lexer

Let’s modify that language_lexer_spec.rb file so it looks like this:

require './lexer'
 
class TestLanguageTester
  describe 'Testing the Lexer' do
    before do
      @evaluator = TestLanguage.new
    end
    
    it 'tests for a single u' do
      result = @evaluator.tokenize("u")
      result[0].should == "Single u."
    end
  end
end

Note here that I’m including the generated lexer.rb file. Also note that in the test I’m checking an index in the result array. That’s because what gets returned from the lexer is an array. In order for this test to work, however, you will have to change the logic that gets executed by your rules in the lexer specification. So change the test_language.rex file so that the rules look like this:

class TestLanguage
rule
  uuu { return "Triple u." }
  uu  { return "Double u." }
  u   { return "Single u." }
  .   { return "Could not match." }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

Here I just changed the Ruby logic to return string values rather than simply outputting them with the puts method. You can now run your test as follows:

rake spec

Now, of course, you can add as many tests as you want:

require './lexer'
 
class TestLanguageTester
  describe 'Testing the Lexer' do
    before do
      @evaluator = TestLanguage.new
    end
    
    it 'tests for a single u' do
      result = @evaluator.tokenize("u")
      result[0].should == "Single u."
    end
    
    it 'tests for a double u' do
      result = @evaluator.tokenize("uu")
      result[0].should == "Double u."
    end
    
    it 'tests for a triple u' do
      result = @evaluator.tokenize("uuu")
      result[0].should == "Triple u."
    end
    
    it 'tests for a no match' do
      result = @evaluator.tokenize("y")
      result[0].should == "Could not match."
    end
  end
end

Granted, all of this is pretty much Test First in Ruby 101, but since this was something I didn’t do in the first post, I wanted to make sure to cover this here.

Okay, so let’s do something a little different with the lexer. Change the lexer file so that your rules look like this:

class TestLanguage
rule
  \d+   { [:DIGIT, text.to_i] }
  \w+   { [:WORD, text] }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

Now you could replace your previous tests with tests like these:

 ...
    it 'tests for a digit' do
      result = @evaluator.tokenize("12")
      result[0][0].should == :DIGIT
      result[0][1].should == 12
    end
    
    it 'tests for a word' do
      result = @evaluator.tokenize("testing")
      result[0][0].should == :WORD
      result[0][1].should == "testing"
    end
  ...

Notice that with these rules, I’m using regular expressions rather than any literals like I did with the “u” example. I’m also specifying an identifier (:DIGIT and :WORD) that is used to identify what type of match has been made. I could have used any terms I wanted to there but obviously being usefully descriptive is helpful. What you can now do is explore with different tests to see what happens. What you might find interesting, for example, is these two other tests:

  ...
    it 'tests for a digit with text' do
      result = @evaluator.tokenize("12test")
      result[0][0].should == :DIGIT
      result[0][1].should == 12
    end
    
    it 'tests for text with a digit' do
      result = @evaluator.tokenize("test12")
      result[0][0].should == :WORD
      result[0][1].should == "test12"
    end
  ...

Again, play around a bit. This is how I’ve been learning how the lexer part works. In fact, as a further means of playing around, you might want to swap the placement of the rules in the lexer specification and see what happens when you run your tests. Look for what fails and try to determine why.

Sometimes you can run into weird stuff that seems to be an artifact of how the logic of the lexer gets generated. For example, let’s say you want to recognize blank space so you add another rule:

class TestLanguage
rule
  [\ \t]+ # no action
  \d+     { [:DIGIT, text.to_i] }
  \w+     { [:WORD, text] }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

You will be able to generate the lexer without a problem but when you try to run the tests you will be told that the lexer.rb file has some errors. I’m not entirely sure why this is at the moment, but it does give me a chance to introduce another section, similar to the inner section but with a different purpose. You can create macros, which are basically identifiers that stand for a pattern. For example, with the above example, I could do this instead:

class TestLanguage
macro
  BLANK   [\ \t]+
 
rule
  {BLANK} # no action
  \d+     { [:DIGIT, text.to_i] }
  \w+     { [:WORD, text] }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

Note that in order to use a macro in a rule, you must enclose the identifier of the macro within curly braces. The macro concept is really just giving a name to a common pattern that you want to use in the rule section. You could add the following test for this:

  ...
    it 'tests for spaces' do
      result = @evaluator.tokenize("   ")
      result[0].should == nil
    end
  ...

Here I’m testing for nil because no action was specified to take place if this rule was matched.

What you should note from this is that the one or more sections are included within the lexer class specification. In this example, we have macro, rule, and inner. Of those, only the rule section is required. A rule, at its simplest, is made up of a pattern to look for and the actions to take when that pattern is found. The pattern can be a literal string to look for or it can be a regular expression. The action is any valid Ruby code. This code can do any necessary processing and can optionally return a value. (If you are using a parser, like Racc, with your lexer then that output can be used by the parser.)

Now let’s do something that’s almost a little bit more exciting or, if not that, at least a little more substantive. This won’t be entirely unique since just about everyone does this when learning this stuff but — let’s build a calculator. We already have the basis for how to do some of this so this will just be an expansion of what we’ve already done. This will let us start getting into Racc as well. I’ll assume by now that you have the basics of how I create these rex files and the test files so I’ll just give contents rather than too many details about file names or directory structures.

So let’s say you have Rex file like this:

class TestLanguage
macro
  DIGIT     \d+
  ADD       \+
  SUBTRACT  \-
  MULTIPLY  \*
  DIVIDE    \/
 
rule
  {DIGIT}     { [:DIGIT, text.to_i] }
  {ADD}       { [:ADD, text] }
  {SUBTRACT}  { [:SUBTRACT, text] }
  {MULTIPLY}  { [:MULTIPLY, text] }
  {DIVIDE}    { [:DIVIDE, text] }
 
inner
  def tokenize(code)
    scan_setup(code)
    tokens = []
    while token = next_token
      tokens << token
    end
    tokens
  end
end

You can perform some simple recognition tests with the following:

it "tests for a digit" do
  result = @evaluator.tokenize("2")
  result[0][0].should == :DIGIT
  result[0][1].should == 2
end
 
it "tests for a symbol" do
  result = @evaluator.tokenize("+")
  result[0][0].should == :ADD
  result[0][1].should == "+"
end

That all seems to work as expected. But what about this test:

it "tests for a calculation" do
  result = @evaluator.tokenize("2+2")
  result[0][0].should == ????
  result[0][1].should == ????
end

That’s what we would want a calculator to be able to handle, right? I want my calculator to have the ability to actually calculate some value as long as appropriate symbols and numbers are used. So what do you expect to happen? What should you fill in those ???? with in the tests? In fact, what you will find is that you have :‌DIGIT and 2 as the returned values because that’s what got parsed first in the string “2+2”.

That makes sense because remember: the lexer is just reading the input and trying to find matches. So it does this but stores each match in a different array. So the test should really be this:

it "tests for a calculation" do
  result = @evaluator.tokenize("2+2")
  result[0][0].should == :DIGIT
  result[0][1].should == 2
  result[1][0].should == :ADD
  result[1][1].should == "+"
  result[2][0].should == :DIGIT
  result[2][1].should == 2
end

So that’s great. But how do we actually get a calculation from this? How do get something that solves 2+2 and returns the value 4? Well, the lexer just parses the symbols. What we need to do is parse those symbols and take action based on them. That, finally, brings us to Racc. And that will have to wait for a different post. Again, though, I encourage you to play around with the lexer. The lexer specification is going to be the basis of any language you want to create. So getting comfortable with being able to express that language as a rex specification and being able to see how to pull information from an input is going to serve you well as you go into the parsing aspects.

Clone this wiki locally