Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time regex for smaller WASM binary size #913

Closed
amatveiakin opened this issue Oct 12, 2022 · 13 comments
Closed

Compile-time regex for smaller WASM binary size #913

amatveiakin opened this issue Oct 12, 2022 · 13 comments
Labels

Comments

@amatveiakin
Copy link

I would like to once again raise the question of compile-time regular expressions. They were once provided by regex-macros crate, but are no longer supported.

If I understand correctly, compiled-time regular expressions were deems unnecessary because their primary use-case was compile-time pattern verification, which can now be achieved by other means, like invalid_regex Clippy lint or lazy-regex crate.

But there is another reason one might want to avoid including a regex compiler into the application, and that is binary size. This becomes especially important with WASM targets. The increase in WASM binary size from the regex crate is very noticeable. In my experiments it varies from ~200 KiB to ~600 KiB depending on the features. I made a small demo, check it out to see for yourself: https://github.com/amatveiakin/regex-wasm-demo

Even 200 KiB is a lot for a web app, and should best be avoided. And I don't think there is a better way to do this than building the regex at compile-time. What do you think?

@BurntSushi
Copy link
Member

BurntSushi commented Oct 12, 2022

compiled-time regular expressions were deems unnecessary because their primary use-case was compile-time pattern verification

Hmmm, I'm not sure that's quite accurate. I deemed them mostly unnecessary for the use case of wanting them to be statically verified, but I don't think I meant to say that the primary use case of compile time regexes themselves was "just" verification. In particular, when I originally added regex!, I was far more interested in its perf advantages. We can go all the way back to 2014 where I posted benchmarks comparing "dynamic" regexes with "natively compiled" or "compile time" regexes: rust-lang/rfcs#42 (comment)

But there is another reason one might want to avoid including a regex compiler into the application, and that is binary size.

I agree that this would substantially reduce binary sizes in a large number of cases. Do note though that it isn't that hard to write a regex that, when compiled, would itself be bigger than the entire regex crate. It doesn't even require writing a pathological regex because Unicode really bloats things up. And that's just a single regex. As soon as you start writing a lot of regexes, it's not hard for those to combine into something quite large.

But the good news is that this is possible to do today, although the flow for doing it has not been streamlined yet. Specifically, see:

bstr generates DFAs and embeds them into the binary. When you build bstr, you only need to bring in the regex runtime, not the compiler. Do note though that bstr uses regex-automata 0.1 and the docs linked above are for regex-automata 0.2. And ucd-generate only works for regex-automata 0.1. So if you want to use regex-automata 0.2, you'll need to write your own little CLI generator, but it shouldn't be too hard to port it from the regex-automata 0.1 generator in ucd-generate.

All of this stuff is a WIP which is why it's a bit of a mess right now. This use case will be fleshed out a bit more with better tooling and less confusion Real Soon Now.

It's also worth pointing out that the compile time regexes that once existed didn't really reduce binary size. In particular, those compile time regexes were "just" a specialization of the PikeVM. It avoided some runtime costs in the "dynamic" PikeVM, but it still relied on the the NFA internals and all that. The way regex-automata works, it builds full DFAs, which are just a chunk of bytes. The regex runtime needed for reading and searching the DFA is indeed quite small.

Another approach would be just compiling the NFA, which would be much much smaller than a DFA in many cases. But that requires more work.

And I don't think there is a better way to do this than building the regex at compile-time. What do you think?

It really really depends. One can build a much smaller regex library by giving up some things, notably, perf and failure modes. For the sake of argument, if I said I could hand you a regex engine that was an order of magnitude smaller (say, 20KB) but was also an order of magnitude slower, would you be happy about that?

@amatveiakin
Copy link
Author

I don't think I meant to say that the primary use case of compile time regexes themselves was "just" verification. In particular, when I originally added regex!, I was far more interested in its perf advantages. We can go all the way back to 2014 where I posted benchmarks comparing "dynamic" regexes with "natively compiled" or "compile time" regexes: rust-lang/rfcs#42 (comment)

I see, thanks for the clarification! The benchmark results look impressive.

Do note though that it isn't that hard to write a regex that, when compiled, would itself be bigger than the entire regex crate. It doesn't even require writing a pathological regex because Unicode really bloats things up.

Interesting. In theory one could always measure, of course. In practice it means we cannot claim that always using compile-time regexes is a safe default :/ Maybe this could be mitigated by a compiler warning “this regex compiled to more than n KiB”?

This use case will be fleshed out a bit more with better tooling and less confusion Real Soon Now.

I guess I'll wait then :)

It's also worth pointing out that the compile time regexes that once existed didn't really reduce binary size. In particular, those compile time regexes were "just" a specialization of the PikeVM.

Oh, I didn't realize compile-time regexes were not compiled to native code. Do you think this would make sense? I guess besides smaller binary size it could also be even faster than the once existing (semi)compiled implementation.

Another approach would be just compiling the NFA

Sorry, I'm not following. We are talking about compiling a regex fully to native code: no regex compilers, no VMs, no NFA to DFA converters, no nothing, right? I understand how DFA can be turned into a straightforward native implementation. But how would you do it for NFA?

if I said I could hand you a regex engine that was an order of magnitude smaller (say, 20KB) but was also an order of magnitude slower, would you be happy about that?

For my particular case — yes. But I don't know if this answer is representative. How much work do you think it would require?

Also it just occurred to me that for WASM code executed in the browser there might be another solution: just delegate regex to JS! There will be some overhead for passing the strings back and forth, but matching itself could potentially be even faster, because it will be implemented in native code, while WASM is executed in a VM.

@BurntSushi
Copy link
Member

To clarify something: there is definitely no way that Regex::new is going to magically use compile time regexes under any circumstances. The best possible thing that could happen there is that Regex::new could become const at some point in the future. But doing that requires most or all of the Rust language (including dynamic memory allocation and interior mutability) to be const. And that's just compiling an existing regex at compile time. It is not actually building out a full DFA.

For the compile time regexes you're after, it would have to be a separate API.

Oh, I didn't realize compile-time regexes were not compiled to native code. Do you think this would make sense? I guess besides smaller binary size it could also be even faster than the once existing (semi)compiled implementation.

It was compiled to native code. The native code was a specialization of the PikeVM.

It's not obvious to me that compile time regexes will be faster. The problem here is that a regex search is far more complicated than a simple "let's do a DFA traversal." There are literal optimizations, and capturing groups and reverse searches and all sorts of other things. When I think of a compile time regex, I think of a very low level "let's build this DFA offline and embed it into our program so we can use it at runtime without needing to build it or bring in the machinery required to build it." It presupposes that you don't need the perf tricks of the full regex engine search.

Sorry, I'm not following. We are talking about compiling a regex fully to native code: no regex compilers, no VMs, no NFA to DFA converters, no nothing, right? I understand how DFA can be turned into a straightforward native implementation. But how would you do it for NFA?

OK, so when you say native code, it sounds like you're thinking about something that builds a DFA and then converts that DFA into a state machine with Rust enums and a match statement, right? If so, I don't personally have any plans to work on that any time soon, and if you want something like that, I think you'd be far better off using something like re2c or Ragel.

The compile time regexes that exist in regex-automata today build table based DFAs. The search runtime involves reading and traversing the table based DFA. The nice thing is that the same search runtime can be used for many DFAs, so the overall binary size of this approach is likely far superior than the "native code" approach I described above.

From there, it's perhaps easier to see how a compile-time NFA would work. You just need a slightly chunkier search runtime that knows how to interpret the NFA and run the PikeVM. It is chunkier, but only slightly so. You don't need the regex parser (which is, currently, a big chunk of the binary size), you don't need a compiler. You just need an NFA, the code to interpret it and the PikeVM. The PikeVM is itself quite small.

Here's the problem: "native code regex" and "compile time regex" and "const fn regex" are all kind of referring to the same idea, but there are actually many many different ways those ideas can be executed. The design space here is huuuuuge.

For my particular case — yes. But I don't know if this answer is representative. How much work do you think it would require?

#656 needs to get done first. Once that's done, the next step would probably be to write a parser that converts concrete syntax to a regex_syntax::hir::Hir directly. The parser would want to be written in a way that minimizes binary size, so you'd forget about error handling and maybe some other things. Once that exists, it'd be possible to build a smaller version of regex-syntax, then build regex-automata with just the PikeVM and supporting infrastructure and you'd be off to the traces. Now, it's not quite the most minimal thing you could possibly do, so I don't actually know if it would be enough. But it's the path I would try. The alternative is just go out and build your own regex engine while making trade offs that minimize binary size at every turn.

The key to understand is that the current regex crate was built without much attention to binary size, because failure modes and perf end up mattering more of the time to most people. I did add crate features to disable a lot of stuff, but they only take you so far.

Also it just occurred to me that for WASM code executed in the browser there might be another solution: just delegate regex to JS! There will be some overhead for passing the strings back and forth, but matching itself could potentially be even faster, because it will be implemented in native code, while WASM is executed in a VM.

Yes. To be honest, I assumed that you had already considered this and dismissed it for one reason or another. :P

@amatveiakin
Copy link
Author

For the compile time regexes you're after, it would have to be a separate API.

Of course. Making Regex::new do totally different things for literal and non-literal strings would definitely go against the principle of least surprise :) But this API could be a drop-in replacement where you change Regex::new to, say, regex! and all (or most) things just work, right?

OK, so when you say native code, it sounds like you're thinking about something that builds a DFA and then converts that DFA into a state machine with Rust enums and a match statement, right?

Yes, that's what I meant. Sorry if this sounds naive. My understanding of regex engines is based on a university course on formal grammars, so I have limited intuition of the real-word trade-offs involved :)

I think you'd be far better off using something like re2c or Ragel.

Interesting, I didn't know about these. Thanks for the links!
Out of curiosity, have you tried benchmarking re2c against the existing compile-time regex implementation?

You just need an NFA, the code to interpret it and the PikeVM. The PikeVM is itself quite small.

This sounds promising.

Here's the problem: "native code regex" and "compile time regex" and "const fn regex" are all kind of referring to the same idea, but there are actually many many different ways those ideas can be executed. The design space here is huuuuuge.

Indeed!

Also it just occurred to me that for WASM code executed in the browser there might be another solution: just delegate regex to JS! <...>

Yes. To be honest, I assumed that you had already considered this and dismissed it for one reason or another. :P

Now that I thought about it more, I see two problems with this solution.
First, this requires JS Snippets feature, which is supported only for some build targets. Namely, bundler and web should work, but nodejs and no-modules will not.
Second, there is the feature-parity problem. I'm writing a library that is used both server-side and client-side. The server side is pure Rust compiled to native code, so it uses the regex crate implementation. Suppose we delegate client-side regexes to JS. The problem is, JS and Rust support different sets of regex features. For example, JS regexes allow backreferences and I don't think there is a way to disable this. The resulting setup would be error-prone, because it opens the door to cases where a piece of code tested on the client starts mysteriously failing on the server or vice versa.
In my opinion, these two problems are nasty enough to avoid this path.

@BurntSushi
Copy link
Member

BurntSushi commented Oct 20, 2022

But this API could be a drop-in replacement where you change Regex::new to, say, regex! and all (or most) things just work, right?

If that were the case, it would imply that compile time regexes support things like capture matches. If you fix that as a requirement, that has quite a large implication on what exactly is being built at compile time. And if you're copying the existing API instead of providing something much more low level, my guess is that users of said API are going to assume that the compile-time API would be at least as fast at searching in all cases. That in turn implies that literal optimizations and basically the full regex engine and all its nooks and crannies are available at compile time. This would in turn negate a lot of the binary size advantage (although not all), and is also a very very very large task.

The thing is, building a data structure that is only used at runtime and building a data structure that can also be loaded cheaply at compile time are two very very different things. I did exactly that for DFAs in regex-automata (of which you are encouraged to peruse the source code), and it adds probably an entire order of magnitude of complexity to the code that wouldn't have otherwise needed to be there without the "compile time" use case. More to the point, a lot of that code requires unsafe.

This sort of design (which AFAIK is required to support "compile time API that is a drop-in replacement for the current API") just infects everything from how you read your data types to how you design your data types themselves. I do think it might be nice to do, but it is an enormous under-taking. We're not even just talking about regex internals either, both the memchr and aho-corasick crates would need to be overhauled to support it as well.

The comparatively simpler thing is to provide lower level compile time APIs. regex-automata does that for fully compiled DFAs. But what you can use them for is limited when compared to a full blown regex. And since their execution model is "just use a DFA," they don't have the literal optimizations that often speed up searches by an order of magnitude (or more).

Yes, that's what I meant. Sorry if this sounds naive. My understanding of regex engines is based on a university course on formal grammars, so I have limited intuition of the real-word trade-offs involved :)

Yeah so this and "drop in replacement for current API" are for the most part incompatible ideas. The former requires handling capture groups and start-of-match reporting. But a "state machine in code using enums" is just as powerful as the table-based DFAs that regex-automata builds today. Which is to say, the only thing it gives you is the end location of a match. (To get the start of a match, you need to build a second separate reverse DFA and use that to do a second scan.) Capturing groups require something more expressive than a DFA.

Out of curiosity, have you tried benchmarking re2c against the existing compile-time regex implementation?

No. It would be nice to do it, but it's nowhere near my priority. I would expect them to be competitive.

Second, there is the feature-parity problem. I'm writing a library that is used both server-side and client-side. The server side is pure Rust compiled to native code, so it uses the regex crate implementation. Suppose we delegate client-side regexes to JS. The problem is, JS and Rust support different sets of regex features. For example, JS regexes allow backreferences and I don't think there is a way to disable this. The resulting setup would be error-prone, because it opens the door to cases where a piece of code tested on the client starts mysteriously failing on the server or vice versa.

Indeed, if you require both regex engines to behave the same, then you either need to use a Javascript-regex compatible regex engine in the backend, or you need to bite the bullet on binary size with the regex crate for the foreseeable future.

And remember, building DFAs from regexes takes worst case exponential time. The regex crate goes to considerable effort to avoid building DFAs for exactly that reason.

Unfortunately, the theory you learned in school is almost useless in this context. The regex crate is 98% engineering and 2% theory. The theory parts are really important, but there's not much to it to be honest.

@amatveiakin
Copy link
Author

Thanks for the detailed answers! This was insightful.

my guess is that users of said API are going to assume that the compile-time API would be at least as fast at searching in all cases.

Not if it's positioned as a binary-size optimization (like "-Os"). But I understand that this use-case is rather niche. The two cases that come to mind are web and embedded.

The comparatively simpler thing is to provide lower level compile time APIs. regex-automata does that for fully compiled DFAs. But what you can use them for is limited when compared to a full blown regex. And since their execution model is "just use a DFA," they don't have the literal optimizations that often speed up searches by an order of magnitude (or more).

I've checked what happens if you do DFA::from_bytes plus re.find_leftmost_fwd(...). WASM binary is indeed much smaller than the regex crate even in the minimal configuration: 46 KiB vs 214 KiB uncompressed size. But of course there is a long way from find_leftmost_fwd to capture support, as I understand now.

Out of curiosity, have you tried benchmarking re2c against the existing compile-time regex implementation?

No. It would be nice to do it, but it's nowhere near my priority. I would expect them to be competitive.

I've benchmarked regex crate against re2c on regex-dna benchmark, and the regex crate came out much faster even without compilation. Probably this is due to literal and other optimizations that you've talked about.

To sum up. There doesn't seem to be any obviously superior solution for reducing the binary size of an application that needs to parse regexes. I've tried to list all the options that we discussed.

Solutions on regex crate side:

R1. Add a feature flag to make regex crate even smaller at the expense of execution speed (blocked on #656).
R2. Compile NFA to byte code. Requires an NFA, the code to interpret it and the PikeVM, which is quite small.
R3. Compile DFA to native code. Minimal overhead for the execution machinery, but: requires huge amounts of work to implement the full regex interface with capture support; code size may grow exponentially; is likely much slower than the run-time version (or requires to embed many optimizations, which reduces the binary size improvement).

Solutions on the user side:

U1. Produce the compiled regex via re2c, Ragel or flex. Requires a lot of manual work: not all tools support Rust output; they complicate the build pipeline; some tools rely on a custom syntax which is not exactly regex; things like capture groups may require additional efforts.
U2. For WASM: delegate regexes to JS. Works only for a fraction of build targets and cannot be made 100% compatible with Rust regex syntax, but could be a viable solution for a client-only library.
U3. Replace regexes with manual parsers. This is feasible if the regexes are few and simple.

(Frankly speaking, I'm inclined to go with U3 for now, because everything else seems surprisingly complicated)

Is this an accurate summary of this discussion? Do you think it makes sense to prioritize something like this for the regex crate in the near future?

@BurntSushi
Copy link
Member

BurntSushi commented Oct 26, 2022

Add a feature flag to make regex crate even smaller at the expense of execution speed

This already exists and it sounds like you've already taken advantage of it. And note that it isn't just about execution speed. You can disable Unicode too. That isn't about speed, but about functionality. If you keep Unicode around, then there's not much that can be done about binary size with just crate features. It might be possible to get a little smaller, but there's not really much left to pursue in this avenue.

Compile NFA to byte code. Requires an NFA, the code to interpret it and the PikeVM, which is quite small.

I think the better way of framing this is to "provide constant time deserialization APIs from &[u8] to NFA, and its dual serialization API that converts an NFA to a Vec<u8> that can be deserialized." If you have that, then everything else falls into place.

Compile DFA to native code. Minimal overhead for the execution machinery, but: requires huge amounts of work to implement the full regex interface with capture support; code size may grow exponentially; is likely much slower than the run-time version (or requires to embed many optimizations, which reduces the binary size improvement).

The "compile DFA to native code" bit is probably not something I plan on pursuing personally. regex-automata, today, already provides the deserialization APIs I mentioned above for NFAs, but for DFAs. So you already can build a DFA offline and then execute a search with it. You just have to use a table based DFA instead of Rust-code-as-a-DFA (re2c or Ragel style).

The latter half of your paragraph doesn't really follow. The latter half of your paragraph is basically: "provide constant time deserialization APIs from &[u8] to Regex, and its dual serialization API that converts a Regex to a Vec<u8> that can be deserialized." That requires the NFA deserialization routines, and doing that for all internal data structures and regex engines.

Do you think it makes sense to prioritize something like this for the regex crate in the near future?

I don't get paid to work on this. I don't schedule my time this way. And depending on what "something like this" is referring to, you're talking about man-years of effort. There is no "near term" here I think.

@BurntSushi
Copy link
Member

I don't get paid to work on this. I don't schedule my time this way. And depending on what "something like this" is referring to, you're talking about man-years of effort. There is no "near term" here I think.

Also, the other part of this is that I don't know whether I want to do some of these things at all. And I don't mean "not work on them" (I am actually personally interested in the work), but rather, whether they're a good fit. The biggest concern I have is that providing those constant time deserialization APIs (which is ultimately what you need to be able to execute regex searches without pulling in the full regex compilation machinery) creates a huge amount of code complexity and requires quite a bit of unsafe to do if you have any hope of not sacrificing search speed. Those are non-trivial downsides, and it may mean that the regex crate is just never something you can use if minimizing binary size is the only thing you care about.

I'm not saying anything for sure at the moment, but I just wanted to make it clear that the ideas themselves are not necessarily unambiguous good directions to head in.

@BurntSushi
Copy link
Member

I'm going to close this because I don't really think there is anything meaningfully actionable here. Basically what this issue boils down to is that the regex crate binary size is "big" for some sense of "big," and particularly in contexts where size might matter for. For example, in WASM. The regex crate already exposes a number of features to trim its binary size down considerably, but the end result is still kind of big. I don't think there's really much more that can be done here, other than building a regex-lite, as described in #961.

In theory, as discussed in this issue, another way out is to make it possible to build regexes into a serialized form, and then provide a way to use the regex's search runtime only along with the deserialization APIs. This could lead to considerably smaller overall binary size, but the complexity (as discussed above) in doing this is enormous. It isn't going to happen any time soon, and I'm not even sure it should happen. In particular, it opens up a huge can of worms with respect to maintaining a stable binary format for regexes, which in turn makes evolution of the implementation itself far more precarious.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
@amatveiakin
Copy link
Author

it opens up a huge can of worms with respect to maintaining a stable binary format for regexes

Is a stable binary format really necessary? If the serialized form is produced by a proc macro (regex!), then it should be possible to guarantee that the parser and the executor are always updated together.

Still, I see that building a compile-time regex parser (however one defines compile-time) is a huge effort which increases maintenance burden and has serious downsides. So closing the issue seems fair.

A regex-lite crate sounds great! Thanks for considering this.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 6, 2023

Is a stable binary format really necessary? If the serialized form is produced by a proc macro (regex!), then it should be possible to guarantee that the parser and the executor are always updated together.

You're right that this path is technically part of the design space, and would address your specific problem. But IMO, if we got all the way to this point, then it will very easily beget the most obvious next request: expose the binary format so that it can be used more flexibly. And if I resolutely reject doing so, it's almost certain that people will use it anyway. Because AIUI in order for something like regex! to work, that format will need to be exposed in some way that others will be able to depend on, even if it isn't part of what I consider to be the public API, it will still technically be public.

@amatveiakin
Copy link
Author

it's almost certain that people will use it anyway

Hmm. Yeah, that sounds plausible.

AIUI in order for something like regex! to work, that format will need to be exposed in some way

I think so too. IIUC you'll have to put the macro into a separate public crate due to Rust/Cargo limitations. And then you can't stop people from using that crate directly.

@BurntSushi
Copy link
Member

I think so too. IIUC you'll have to put the macro into a separate public crate due to Rust/Cargo limitations. And then you can't stop people from using that crate directly.

Right, although to be clear here, what I'd expect would need to happen is that regex would need to expose some kind of API that can be called by that macro crate. And since the macro crate is just another crate, the API exposed by regex would be callable by anyone. I would of course slap a #[doc(hidden)] on it, but it's very likely people will find it and use it. Because having a binary format you can serialize and store of a regex and then load later is an incredibly useful and flexible thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants