rust-lang · alexcrichton · Apr 22, 2014 · Apr 13, 2014 · Apr 13, 2014 · Apr 13, 2014
diff --git a/active/0000-regexps.md b/active/0000-regexps.md
@@ -0,0 +1,279 @@
+- Start Date: 2014-04-12
+- RFC PR #: (leave this empty)
+- Rust Issue #: (leave this empty)
+
+# Summary
+
+Add a `regexp` crate to the Rust distribution in addition to a small 
+`regexp_macros` crate that provides a syntax extension for compiling regular 
+expressions during the compilation of a Rust program.
+
+The implementation that supports this RFC is ready to receive 
+feedback: https://github.com/BurntSushi/regexp
+
+Documentation for the crate can be seen here:
+http://burntsushi.net/rustdoc/regexp/index.html
+
+regex-dna benchmark (vs. Go, Python): 
+https://github.com/BurntSushi/regexp/tree/master/benchmark/regex-dna
+
+Other benchmarks (vs. Go): 
+https://github.com/BurntSushi/regexp/tree/master/benchmark
+
+(Perhaps the links should be removed if the RFC is accepted, since I can't 
+guarantee they will always exist.)
+
+# Motivation
+
+Regular expressions provide a succinct method of matching patterns against 
+search text and are frequently used. For example, many programming languages 
+include some kind of support for regular expressions in its standard library.
+
+The outcome of this RFC is to include a regular expression library in the Rust 
+distribution and resolve issue
+[#3591](https://github.com/mozilla/rust/issues/3591).
+
+# Detailed design
+
+(Note: This is describing an existing design that has been implemented. I have 
+no idea how much of this is appropriate for an RFC.)
+
+The first choice that most regular expression libraries make is whether or not 
+to include backreferences in the supported syntax, as this heavily influences 
+the implementation and the performance characteristics of matching text.
+
+In this RFC, I am proposing a library that closely models Russ Cox's RE2 
+(either its C++ or Go variants). This means that features like backreferences 
+or generalized zero-width assertions are not supported. In return, we get 
+`O(mn)` worst case performance (with `m` being the size of the search text and 
+`n` being the number of instructions in the compiled expression).
+
+My implementation currently simulates an NFA using something resembling the 
+Pike VM. Future work could possibly include adding a DFA. (N.B. RE2/C++ 
+includes both an NFA and a DFA, but RE2/Go only implements an NFA.)
+
+The primary reason why I chose RE2 was that it seemed to be a popular choice in 
+issue [#3591](https://github.com/mozilla/rust/issues/3591), and its worst case 
+performance characteristics seemed appealing. I was also drawn to the limited 
+set of syntax supported by RE2 in comparison to other regexp flavors.
+
+With that out of the way, there are other things that inform the design of a 
+regexp library.
+
+## Unicode
+
+Given the already existing support for Unicode in Rust, this is a no-brainer. 
+Unicode literals should be allowed in expressions and Unicode character classes 
+should be included (e.g., general categories and scripts).
+
+Case folding is also important for case insensitive matching. Currently, this 
+is implemented by converting characters to their uppercase forms and then 
+comparing them. Future work includes applying at least a simple fold, since 
+folding one Unicode character can produce multiple characters.
+
+Normalization is another thing to consider, but like most other regexp 
+libraries, the one I'm proposing here does not do any normalization. (It seems 
+the recommended practice is to do normalization before matching if it's 
+needed.)
+
+A nice implementation strategy to support Unicode is to implement a VM that 
+matches characters instead of bytes. Indeed, my implementation does this.
+However, the public API of a regular expression library should expose *byte 
+indices* corresponding to match locations (which ought to be guaranteed to be 
+UTF8 codepoint boundaries by construction of the VM). My reason for this is 
+that byte indices result in a lower cost abstraction. If character indices are 
+desired, then a mapping can be maintained by the client at their discretion.
+
+Additionally, this makes it consistent with the `std::str` API, which also 
+exposes byte indices.
+
+## Word boundaries, word characters and Unicode
+
+At least Python and D define word characters, word boundaries and space 
+characters with Unicode character classes. My implementation does the same
+by augmenting the standard Perl character classes `\d`, `\s` and `\w` with
+corresponding Unicode categories.
+
+## Leftmost-first
+
+As of now, my implementation finds the leftmost-first match. This is consistent 
+with PCRE style regular expressions.
+
+I've pretty much ignored POSIX, but I think it's very possible to add 
+leftmost-longest semantics to the existing VM. (RE2 supports this as a 
+parameter, but I believe still does not fully comply with POSIX with respect to 
+picking the correct submatches.)
+
+## Public API
+
+There are three main questions that can be asked when searching text:
+
+1. Does the string match this expression?
+2. If so, where?
+3. Where are its submatches?
+
+In principle, an API could provide a function to only answer (3). The answers 
+to (1) and (2) would immediately follow. However, keeping track of submatches 
+is expensive, so it is useful to implement an optimization that doesn't keep 
+track of them if it doesn't have to. For example, submatches do not need to be 
+tracked to answer questions (1) and (2).
+
+The rabbit hole continues: answering (1) can be more efficient than answering 
+(2) because you don't have to keep track of *any* capture groups ((2) requires 
+tracking the position of the full match). More importantly, (1) enables early 
+exit from the VM. As soon as a match is found, the VM can quit instead of 
+continuing to search for greedy expressions.
+
+Therefore, it's worth it to segregate these operations. The performance 
+difference can get even bigger if a DFA were implemented (which can answer (1) 
+and (2) quickly and even help with (3)). Moreover, most other regular 
+expression libraries provide separate facilities for answering these questions
+separately.
+
+Some libraries (like Python's `re` and RE2/C++) distinguish between matching an 
+expression against an entire string and matching an expression against part of 
+the string. My implementation favors simplicity: matching the entirety of a 
+string requires using the `^` and/or `$` anchors. In all cases, an implicit 
+`.*?` is added the beginning and end of each expression evaluated. (Which is 
+optimized out in the presence of anchors.)
+
+Finally, most regexp libraries provide facilities for splitting and replacing 
+text, usually making capture group names available with some sort of `$var` 
+syntax. My implementation provides this too. (These are a perfect fit for 
+Rust's iterators.)
+
+This basically makes up the entirety of the public API, in addition to perhaps 
+a `quote` function that escapes a string so that it may be used as a literal in 
+an expression.
+
+## The `regexp!` macro
+
+With syntax extensions, it's possible to write an `regexp!` macro that compiles 
+an expression when a Rust program is compiled. This includes translating the
+matching algorithm to Rust code specific to the expression given. This "ahead 
+of time" compiling results in a performance increase. Namely, it elides all 
+heap allocation.
+
+I've called these "native" regexps, whereas expressions compiled at runtime are 
+"dynamic" regexps. The public API need not impose this distinction on users, 
+other than requiring the use of a syntax extension to construct a native 
+regexp. For example:
+
+    let re = regexp!("a*");
+
+After construction, `re` is indistinguishable from an expression created 
+dynamically:
+
+    let re = Regexp::new("a*").unwrap();
+
+In particular, both have the same type. This is accomplished with a 
+representation resembling:
+
+    enum MaybeNative {
+        Dynamic(~[Inst]),
+        Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
+    }
+
+This syntax extension requires a second crate, `regexp_macros`, where the 
+`regexp!` macro is defined. Technically, this could be provided in the `regexp` 
+crate, but this would introduce a runtime dependency on `libsyntax` for any use 
+of the `regexp` crate.
+
+[@alexcrichton 
+remarks](https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112)
+that this state of affairs is a wart that will be corrected in the future.
+
+## Untrusted input
+
+Given worst case `O(mn)` time complexity, I don't think it's worth worrying 
+about unsafe search text.
+
+Untrusted regular expressions are another matter. For example, it's very easy 
+to exhaust a system's resources with nested counted repetitions. For example,
+`((a{100}){100}){100}` tries to create `100^3` instructions. My current 
+implementation does nothing to mitigate against this, but I think a simple hard 
+limit on the number of instructions allowed would work fine. (Should it be 
+configurable?)
+
+## Name
+
+The name of the crate being proposed is `regexp` and the type describing a 
+compiled regular expression is `Regexp`. I think an equally good name would be 
+`regex` (and `Regex`). Either name seems to be frequently used, e.g., "regexes" 
+or "regexps" in colloquial use. I chose `regexp` over `regex` because it 
+matches the name used for the corresponding package in Go's standard library.
+
+Other possible names are `regexpr` (and `Regexpr`) or something with 
+underscores: `reg_exp` (and `RegExp`). However, I perceive these to be more 
+ugly and less commonly used than either `regexp` or `regex`.
+
+Finally, we could use `re` (like Python), but I think the name could be 
+ambiguous since it's so short. `regexp` (or `regex`) unequivocally identifies 
+the crate as providing regular expressions.
+
+For consistency's sake, I propose that the syntax extension provided be named
+the same as the crate. So in this case, `regexp!`.
+
+## Summary
+
+My implementation is pretty much a port of most of RE2. The syntax should be 
+identical or almost identical. I think matching an existing (and popular) 
+library has benefits, since it will make it easier for people to pick it up and 
+start using it. There will also be (hopefully) fewer surprises. There is also 
+plenty of room for performance improvement by implementing a DFA.
+
+# Alternatives
+
+I think the single biggest alternative is to provide a backtracking 
+implementation that supports backreferences and generalized zero-width 
+assertions. I don't think my implementation precludes this possibility. For 
+example, a backtracking approach could be implemented and used only when 
+features like backreferences are invoked in the expression. However, this gives 
+up the blanket guarantee of worst case `O(mn)` time. I don't think I have the 
+wisdom required to voice a strong opinion on whether this is a worthwhile 
+endeavor.
+
+Another alternative is using a binding to an existing regexp library. I think 
+this was discussed in issue 
+[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people 
+favor a native Rust implementation if it's to be included in the Rust 
+distribution. (Does the `regexp!` macro require it? If so, that's a huge 
+advantage.) Also, a native implementation makes it maximally portable.
+
+Finally, it is always possible to persist without a regexp library.
+
+# Unresolved questions
+
+The public API design is fairly simple and straight-forward with no 
+surprises.  I think most of the unresolved stuff is how the backend is 
+implemented, which should be changeable without changing the public API (sans 
+adding features to the syntax).
+
+I can't remember where I read it, but someone had mentioned defining a *trait* 
+that declared the API of a regexp engine. That way, anyone could write their 
+own backend and use the `regexp` interface. My initial thoughts are 
+YAGNI---since requiring different backends seems like a super specialized 
+case---but I'm just hazarding a guess here. (If we go this route, then we
+might want to expose the regexp parser and AST and possibly the 
+compiler and instruction set to make writing your own backend easier. That 
+sounds restrictive with respect to making performance improvements in the 
+future.)
+
+I personally think there's great value in keeping the standard regexp 
+implementation small, simple and fast. People who have more specialized needs 
+can always pick one of the existing C or C++ libraries.
+
+For now, we could mark the API as `#[unstable]` or `#[experimental]`.
+
+# Future work
+
+I think most of the future work for this crate is to increase the performance, 
+either by implementing different matching algorithms (e.g., a DFA) or by 
+improving the code generator that produces native regexps with `regexp!`.
+
+If and when a DFA is implemented, care must be taken when creating a code 
+generator, as the size of the code required can grow rapidly.
+
+Other future work (that is probably more important) includes more Unicode 
+support, specifically for simple case folding.
+