Skip to content

Latest commit

 

History

History
475 lines (365 loc) · 18.9 KB

README.md

File metadata and controls

475 lines (365 loc) · 18.9 KB

Pattern

Pattern lets you transform data structures in amazing ways.

Start with the intro

The focus is on simplicity and expressivity. It works on immutable persistent data, greatly simplifying development and debugging of transformations.

Current release on Clojars

pangloss/pattern {:git/url "https://github.com/pangloss/pattern"
                  :sha "<use the current commit hash>"}

Video

See my Nov 2022 talk for London Clojurians

How can this be used?

Here are a few examples of how it's been used already:

  • Create simple and clear macros
  • Create an infix math macro in about 30 LOC
  • Define the simplification rules of a computer algebra system (see a very similar engine in use in SICMUtils)
  • Create a Python to Clojure source-to-source converter in under 400 LOC
  • Compile Scheme to X86 assembly in about 1500 LOC

The primary tools in Pattern are the rule and rule combinator. To understand how to use those, it's important to first introduce matcher patterns and substitution.

How does it work?

Pattern is an collection of tools for pattern matching and substitution. Those tools can be extended and combined in flexible ways.

The match pattern looks similar to the data structure it's matching on. That enables a very intuitive understanding of how the pattern will apply to the data being matched.

A transformation is a match pattern with a substitution pattern. Combined, that is called a rule.

The substitution pattern uses exactly the same syntax as the match pattern. Substitution supports all of core functionality of the matcher. That makes it much simpler to understand the behavior of a rule.

In the spirit of Clojure, pattern maintains a high degree of simplicity in the interface.

Pattern Matching

(require '[pattern :refer [compile-pattern]])

The core of Pattern is the pattern matcher, which can be used to match any kind of Clojure data structure. To use the pattern matcher, call (compile-pattern my-pattern). That returns a function that you can call with the data you want to extract matches from. The return value is a map from match names to matching data.

Here is a simple example that matches the first and last elements of a vector if the central elements are 1 2 3.

(def m (compile-pattern '[?before 1 2 3 ?after])

(m [0 1 2 3 4]) ; => {:before 0 :after 4}

Even a simple pattern encodes sophisticated matching logic. For instance you could replicate the above matcher with this code. As you can see, it's more complex, less clear in intention, and easier to make a mistake:

(def m #(when (vector? %)
          (if (and (= 5 (count %))
                (= [1 2 3] (subvec % 1 4)))
            {:before (first %)
             :after (last %)})))

(m [0 1 2 3 4]) ; => {:before 0 :after 4}

The advantages of these pattern matchers become even more apparent as the complexity of pattern increases.

Unification Within a Pattern

If multiple matchers in a pattern have the same name, the values they match must unify.

Unification increases the sophistication of patterns that can be defined.

(def m (compile-pattern '[?fn-name (fn ?fn-name [??args] ??body)]))

(m ['abc '(fn abc [x] x)]) ; => {:fn-name 'abc ...}
(m ['xyz '(fn abc [x] x)]) ; => nil

Unification works across different matcher types. The pattern [?list [??list 3]], could match [[1 2] [1 2 3]] or [[:x] [:x 3]], etc.

Pattern Matchers Available

Above I showed 2 matchers: match-list and match-element matching against a fixed-length vector. Much more flexibility is available.

The pattern could be changed to [??before 1 2 3 ??after], for instance. Now it would match a vector of any length as long as it contains the sequence 1 2 3.

(def m* (compile-pattern '[??before 1 2 3 ??after]))

(m* [1 2 3])       ; => {:before [] :after []}
(m* [0 1 2 3 4 5]) ; => {:before [0] :after [4 5]}

Patterns also do not need to be flat. For instance, I can match against Clojure S-Expressions. Here I'll introduce a three more matcher types:

(def let-bindings (compile-pattern '(let [(?:* (? binding* symbol?) ?_)] ??_)))

(let-bindings '(let [datum (first data)
                     c (count datum)]
                 ...))
;; => {:binding* [datum c]}

?_ and ??_ are special cases of unnamed matchers. These matchers do not unify with each other, and the value matched by them is not returned in the match results. They are useful for ignoring parts of the input data that are not relevant to what you are trying to do. Anywhere you can provide a matcher name, using _ as the name has the same behavior.

The ?:* matcher is match-many. It matches a repeated subsequence within the sequence it is placed in. In this case, that allows it to handle the characteristic flattened Clojure binding pattern.

The (? ...) matcher is the full form of match-element that we saw earlier. In this case, it is equivalent to ?binding*, except that here I gave it a predicate function. The matcher will only succeed if the predicate returns a truthy value when called with the matched value.

That means the following does not match:

(let-bindings '(let [42 (first data)
                     c (count datum)]
                 ...))
;; => nil

There are a lot more matchers and the list is gradually expanding with ever more weird and wonderful behaviors.

Each matcher in the list has a matcher implementation function with detailed documentation in the pattern.matchers namespace.

(better docs coming eventually?!)

Matcher Implementation Notes
? match-element Match a single element
?? match-segment Match 0 or more elements
??! Same as ??, but greedy
{} ?:map match-map Match a map with specific keys
#{} ?:set match-*set Match items in a set
?:*map/?:map* match-*map Match each key-value pair in a map
?:+map/?:map+ match-+map Like ?:*map, but require at least one match
?:as match-as Capture an entire sub pattern if the contained pattern matches
?:? match-optional Match 0 or 1 instance
?:1 match-one Match exactly 1 instance
?:* match-many Match any number of instances
?:+ match-at-least-one Match at least one instance
?:n match-n-times Match the given sequence exactly n times.
?:nth
?:item Match the first matching item from a sequence
?:chain match-chain Alternate between patterns and functions on the matched data
??:chain Same as ?:chain but capture sequence data
| match-or Match alternative patterns on the same data
& match-and Match all patterns on the same data
?:= match-literal Match a literal value (needed if the form looks like a matcher pattern)
?:not match-not Match only if the contained pattern does not
?:when match-when If the predicate, match the pattern
?:if match-if If the predicate, match the then pattern, otherwise the else pattern
?:letrec match-letrec Create named subpatterns that can be used later
$ match-ref Use a named subpattern defined in ?:letrec
?:fresh match-fresh Match a new copy of the given name
?:all-fresh match-all-fresh Don't unify with captures outside the block
?:restartable match-restartable If the pattern doesn't match, raise a condition that can provide a value
?:re-matches match-regex Match with a regex, binding captures
?:re-seq match-regex Multiple matches with a regex, binding captures
?:filter match-filter Match only the items in the list where the predicate returns true
??:filter Inline sequence version of ?:filter
?:remove Match only the items in the list where the predicate returns false
??:remove Inline sequence version of ?:remove

A little about regex matchers

The ?:re-matches pattern takes a second form which is a pattern that matches the structure returned by clojure.core/re-matches:

(let [matcher (pattern/compile-pattern
                '(?:re-matches #"(\d+) (.+)" ?re-matches))]
  (matcher "123 [a b c]"))

That returned structure can itself be matched:

(let [matcher (pattern/compile-pattern
                '(?:re-matches #"(\d+) (.+)"
                   [?full_match ?group1 ?group2]))]
  (matcher "123 [a b c]"))

There is no way that I know of to discover group names, so here foo will not be included in the bound names matched by the compiled pattern. Here, ?foo will bind to 999 just fine, because the named group in the regex is not exposed outside of the regex itself.

(let [matcher (pattern/compile-pattern
                '[(?:re-matches #"(?<foo>\d+) (.+)"
                    [?full_match ?group1 ?group2])
                  ?foo])]
  (matcher ["123 [a b c]" "999"]))

Within the pattern, I can specify the name with ?foo, though. If I do then the two instances must unify. This now fails to match:

(let [matcher (pattern/compile-pattern
                '[(?:re-matches #"(?<foo>\d+) (.+)"
                    [?full_match ?foo ?group2])
                  ?foo])]
  (matcher ["123 [a b c]" "999"]))

Named matches can still be useful in the regex and they together with all features are perfectly acceptable to use. Here the number at the beginning and the end of the string must match for the regex to match.

(let [matcher (pattern/compile-pattern
                '(?:re-matches #"(?<foo>\d+) (.+) \k<foo>"
                   [?full_match ?group2 ?group1]))]
  (matcher "123 [a b c] 123"))

The captured data may have further matchers applied to it. Here I use the ?:chain matcher to parse the second group with read-string, then match within the parsed form.

(let [matcher (pattern/compile-pattern
                '(?:re-matches #"(?<foo>\d+) (.+) \k<foo>"
                   [?full_match
                    ?group2
                    (?:chain ?group1 read-string [?x ?y ?z])]))]
  (matcher "123 [a b c] 123"))

Adding new matchers

The ?:chain rule can often be used to create a custom matcher. This is the implementation of the ?:map one.

This matcher starts with an unnamed matcher that ensures the matched datum is either a map or nil: (? _ (some-fn nil? map?)) Then if that matches, calls seq on that datum, turning the map into a list of [key value] pairs. The result may match either nil if the map is empty, or each key value pair must match the given k and v pattern forms, which are spliced into the matcher here: (| nil ((?:* ~[k v]))).

The generated matcher is then compiled with compile-pattern*, passing along the compilation environment.

Finally, new matchers must be registered with their name and optional aliases.

(defn match-*map
  "Create a ?:*map matcher than can match a key/value pair multiple times."
  [[_ k v] comp-env]
  (compile-pattern* `(~'?:chain
                      (~'? ~'_ ~(some-fn nil? map?))
                      seq
                      (~'| nil ((~'?:* ~[k v]))))
                    comp-env))

(register-matcher '?:*map #'match-*map {:aliases ['?:map*]})

Substitution

The core substitution function is sub. It is actually a macro that expands to be equivalent to just writing the substitution pattern as a literal value.

(let [x 2
      y '[a b c]]
  (sub (* (+ ??y) ?x)))
;; => (* (+ a b c) 2)

Macroexpanded, you can see that the substitution overhead is effectively zero:

(let [x 2
      y '[a b c]] 
  (list '* (list* '+ y) x))

You may notice that instead of passing in a map of values to sub, it effectively looks in the current evaluation context for replacement values. This tends to be very convenient. In practice, 99% of the time I use this method.

For that last 1%, though, a more traditional data-driven substitution method is required. For that, we can use substitute, which interprets and executes the pattern at runtime:

(substitute 
  '(* (+ ??y) ?x)
  {'x 2 'y '[a b c]})
;; => (* (+ a b c) 2)

Rules

Rules are pattern matchers tied to function bodies. The captured data from the matcher becomes the function arguments.

Most frequently the function body is modifying data and thu sub macro is the perfect tool, but any value can be returned by a matching rule.

(rule mult-id
  '(* 1 ?x)
  x)
  
(rule strength-reduce
  '(* 2 ?x)
  (sub (+ ?x ?x)))

Rules are implemented in terms of standard matchers, so any of the above matchers may be used.

A rule or rule combinator can be called like a function.

(def my-rule (rule '(* 2 ?x) (sub (+ ?x ?x))))

(my-rule '(* 2 5)) # => (+ 5 5)

If the rule body returns nil or false, the rule is treated the same as if the pattern didn't match, which lets you do any arbitrary check against what you matched. If you'd really like to to return nil or false, use (success nil) or (success false).

Rule Combinators

Rule combinators allow multiple rules to be used together. Rule combinators can themselves be combined. A rule combinator is itself just a rule. Rules and rule combinators are interchangeable.

The following combinators are documented for now in the pattern.r3.combinators namespace:

Rule combinator Note
rule-list Search the rules from top to bottom. Stop after successfully matching.
in-order Search the rules from top to bottom. Continue with the result of each matching rule.
on-subexpressions Try the given rule on every element in the datastructure in post-walk order
iterated Keep applying the rule until a fixed point is reached
simplifier Like on-subexpressions which iterates at every level
directed Guide the traversal either using the pattern or from within the rule body
scanner Convert any rule combinator to scan through a list or vector

Scanner rules

If you have rules that transform sequential data, especially long ones, their performance can benefit from using the scanner combinator.

Say you're working on data like that (which is the result of processing a string like "Quantities: 5ml or 10gr" using a natural language library):

(def in
  [{:text "Quantities:", :pos :word}
   {:text " ", :pos :ws}
   {:text "5", :pos :number}
   {:text "ml", :pos :unit}
   {:text " ", :pos :ws}
   {:text "or", :pos :word}
   {:text " ", :pos :ws}
   {:text "10", :pos :number}
   {:text "gr", :pos :unit}])

And you'd like to produce group the number and unit tokens into a new map. You may write a rule that looks like so:

(def group-quantities
  (iterated
   (rule group-quantities
     '[??before
       (?:as scalar (?:map :pos :number))
       (?:? (?:map :text ?ws :pos :ws))
       (?:as unit (?:map :pos :unit))
       ??after]
     (vec
      (concat
       before
       [{:pos    :quantity
         :text   (str (:text scalar) ws (:text unit))
         :scalar scalar
         :unit   unit}]
       after)))))

Running it would yield:

> (group-quantities in)
[{:text "Quantities:", :pos :word}
 {:text " ", :pos :ws}
 {:pos :quantity,
  :text "5ml",
  :scalar {:text "5", :pos :number},
  :unit {:text "ml", :pos :unit}}
 {:text " ", :pos :ws}
 {:text "or", :pos :word}
 {:text " ", :pos :ws}
 {:pos :quantity,
  :text "10gr",
  :scalar {:text "10", :pos :number},
  :unit {:text "gr", :pos :unit}}]

The iterated combinator is necessary because without it the rule would match and transform the first pair of number and unit and leave the second untouched. But it also means that after the vector is transformed once, Pattern will restart scanning the elements from the beginning, which is of course wasteful.

For such use cases, the rule can be rewritten with scanner:

(def group-quantities
  (scanner
   (rule group-quantities
     '[(?:as scalar (?:map :pos :number))
       (?:? (?:map :text ?ws :pos :ws))
       (?:as unit (?:map :pos :unit))]
     [{:pos    :quantity
       :text   (str (:text scalar) ws (:text unit))
       :scalar scalar
       :unit   unit}])))

Running this rule results in identical output, but each time the rule matches, Pattern continues on to the rest of the vector instead of restarting from the first element, resulting in better performance. Also, the rule is shorter because you only have to provide matchers for the just the "window" of the sequence you're looking to transform and you don't need to reconstruct the original vector.

Compilation Dialects and Tools

...

Acknowledgements

Pattern is based upon the core ideas described in the excellent book Software Design for Flexibilty by Chris Hanson and Gerald Jay Sussman.

The compilation tools aspect is heavily inspired by the Nanopass Framework which introduces the idea of dialects and was the inspiration for some of the more powerful rule combinators that this library introduces.

License

Copyright © 2022 Darrick Wiebe

Distributed under the Eclipse Public License version 1.0.