Skip to content

ammar/meta_re

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaRegexp

Note: this project has not been released as a gem, but might be someday.

What is it

A regular expression “pre-processor” that performs the following transformations:

  • Repetition Expansion: expands repeated grouped expressions to enable repeated captures, without manual duplication.

  • Aliases: recognizes and substitutes registered regular expression aliases at any depth (i.e. aliases can contain aliases), including the expansion of any grouped expressions within them.

The following sections cover these transformations in more detail.

Repetition Expansion

Using the built-in Regexp class a regular expression that contains repeated (i.e. quantified) capturing groups only gets its last match captured into the resulting MatchData. For example, given the following pattern and input text:

regex = /([a-z]+\s?){1,3}/
input  = "one two three"

The built-in Regexp class produces:

Regexp.new( regex ).match( input ).captures
#=> ["three"]

With MetaRexgep’s repetition expansion, the same expression captures all the grouped expressions in the MatchData:

MetaRegexp.new( regex ).match( input ).captures
#=> ["one ", "two ", "three"]

The example above is very simple (the same can be done with scan /w+/) but it was purposefully chosen to quickly illustrate the benefit of expansion. A more elaborate example is presented below.

How Expansion Works

Expansion is accomplished by parsing the given expression, string or Regexp object, collecting information about the grouped expressions within it and their minimum and maximu repetitions into a “tree” structure. This is then recompiled into a new string that is used to create a new Regexp object.

So, given the following regular expression:

/([a-z]+){1,3}/

MetaRegexp transforms it into the following:

/([a-z]+)([a-z]+)?([a-z]+)?/

Now the pattern contains actual groups that will be captured during the matching process. Note how the last two instances of the original pattern are marked as optional matches with the zero-or-one meta character ‘?’. The minimum is one (required) and the maximum is 3, i.e the last two matches are optional.

This type of expansion is usually done ‘manually’ by interpolating or typing the desired capture patterns multiple times. In addition to not being DRY, this approach makes the resulting expressions harder to edit and read.

Notes

* The "or more" quantifiers, zero or more (*), and one or more (+) use a
  default value of 12 for the maximum number of repetitions. A different
  value can be specified as the second argument to #new of MetaRegexp.

* To avoid over expansion with * and +, use the interval (a.k.a range)
  quantifier ({min,max}) for repeating groups.

* Only greedy and reluctant (lazy) quantifiers are supported and tested
  so far.

* Possible matches that do not occur appear as nil values in the captures
  array. See the MatchData class in meta_re.rb for more info and ways to
  deal with these.

An Example

Here’s a more elaborate example, this time for matching HTML tags. It is somewhat simplified in terms of the tag and attribute names to make it eaiser to follow.

Given the following input and regular expression:

# input text
tag = '<input type="text" value="hello" id="body" onclick="click_text();" />'

# regular expression
rex = /<[a-z]+\s+(([a-z]+)="([^"]+)"\s*){1,8}\s*\/>/

With Regexp:

Regexp.new( rex ).match( tag ).captures

# captures
["onclick=\"click_text();\" ", "onclick", "click_text();"]

Now with MetaRegexp:

MetaRegexp.new( rex ).match( tag ).captures

# captures, formatted for readability:
[
  "type=\"text\" ", "type", "text",
  "value=\"hello\" ", "value", "hello",
  "id=\"body\" ", "id", "body",
  "onclick=\"click_text();\" ", "onclick", "click_text();",
  nil, nil, nil,
  nil, nil, nil,
  nil, nil, nil,
  nil, nil, nil
]

The nil values that appear at the end are for possible matches that did not occur. See the notes (above) for more on this.

Aliases

Aliases are like macros, a way to define named regular expression patterns that can be used on their own or within larger patterns.

This type of pattern reuse is usually done with constants and simple string concatination, but it is only effective for one level of depth. With MetaRegexp aliases, new aliases can be defined in terms of other aliases. This allows for building more granural “libraries” of reusable patterns.

Aliasing Example

Aliasing is off by default. It can be enabled with:

MetaRegexp.aliasing true

Next define some aliases, smaller patterns are used to build larger patterns:

MetaRegexp.alias :name,      /[a-z][-a-z0-9_]+/
MetaRegexp.alias :scheme,    /https?|ftps?/
MetaRegexp.alias :tld,       'com|net|org|edu'
MetaRegexp.alias :domain,    /((?:(@name)\.){1,4}(@tld))/
MetaRegexp.alias :uri,       '@scheme://@domain'

To use an alias, prepend its name with an at sign (@):

>> MetaRegexp.new /@domain/
=> /((?:([a-z][-a-z0-9_]+)\.)(?:([a-z][-a-z0-9_]+)\.)?(?:([a-z][-a-z0-9_]+)
   \.)?(?:([a-z][-a-z0-9_]+)\.)?(com|net|org|edu))/

>> MetaRegexp.new '@uri' 
=> /(?:https?|ftps?):\/\/(?:((?:([a-z][-a-z0-9_]+)\.)(?:([a-z][-a-z0-9_]+)
   \.)?(?:([a-z][-a-z0-9_]+)\.)?(?:([a-z][-a-z0-9_]+)\.)?(com|net|org|edu)))/

Naming Alias Captures

In addition to the normal alias resolution, where aliases are replaced with their defined values as is, they can also be wrapped with named groups that use the same name as the alias. To enable named alias groups use:

MetaRegexp.name_groups true

>> MetaRegexp.new '@scheme'
=> /(?<scheme>https?|ftps?)/

Now matches for that alias can be accessed by name from the match data.

Known Probelm

Aliases that contain repetitions will result in multiple groups having the same name. Obviously these matches will overwrite one another on the captures list. This will be corrected in a future version.

Copyright © 2010 Ammar Ali. See LICENSE file for details.

About

An experimental regular expression "preprocessor"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages