-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile-time regex for smaller WASM binary size #913
Comments
Hmmm, I'm not sure that's quite accurate. I deemed them mostly unnecessary for the use case of wanting them to be statically verified, but I don't think I meant to say that the primary use case of compile time regexes themselves was "just" verification. In particular, when I originally added
I agree that this would substantially reduce binary sizes in a large number of cases. Do note though that it isn't that hard to write a regex that, when compiled, would itself be bigger than the entire regex crate. It doesn't even require writing a pathological regex because Unicode really bloats things up. And that's just a single regex. As soon as you start writing a lot of regexes, it's not hard for those to combine into something quite large. But the good news is that this is possible to do today, although the flow for doing it has not been streamlined yet. Specifically, see:
All of this stuff is a WIP which is why it's a bit of a mess right now. This use case will be fleshed out a bit more with better tooling and less confusion Real Soon Now. It's also worth pointing out that the compile time regexes that once existed didn't really reduce binary size. In particular, those compile time regexes were "just" a specialization of the PikeVM. It avoided some runtime costs in the "dynamic" PikeVM, but it still relied on the the NFA internals and all that. The way Another approach would be just compiling the NFA, which would be much much smaller than a DFA in many cases. But that requires more work.
It really really depends. One can build a much smaller regex library by giving up some things, notably, perf and failure modes. For the sake of argument, if I said I could hand you a regex engine that was an order of magnitude smaller (say, 20KB) but was also an order of magnitude slower, would you be happy about that? |
I see, thanks for the clarification! The benchmark results look impressive.
Interesting. In theory one could always measure, of course. In practice it means we cannot claim that always using compile-time regexes is a safe default :/ Maybe this could be mitigated by a compiler warning “this regex compiled to more than n KiB”?
I guess I'll wait then :)
Oh, I didn't realize compile-time regexes were not compiled to native code. Do you think this would make sense? I guess besides smaller binary size it could also be even faster than the once existing (semi)compiled implementation.
Sorry, I'm not following. We are talking about compiling a regex fully to native code: no regex compilers, no VMs, no NFA to DFA converters, no nothing, right? I understand how DFA can be turned into a straightforward native implementation. But how would you do it for NFA?
For my particular case — yes. But I don't know if this answer is representative. How much work do you think it would require? Also it just occurred to me that for WASM code executed in the browser there might be another solution: just delegate regex to JS! There will be some overhead for passing the strings back and forth, but matching itself could potentially be even faster, because it will be implemented in native code, while WASM is executed in a VM. |
To clarify something: there is definitely no way that For the compile time regexes you're after, it would have to be a separate API.
It was compiled to native code. The native code was a specialization of the PikeVM. It's not obvious to me that compile time regexes will be faster. The problem here is that a regex search is far more complicated than a simple "let's do a DFA traversal." There are literal optimizations, and capturing groups and reverse searches and all sorts of other things. When I think of a compile time regex, I think of a very low level "let's build this DFA offline and embed it into our program so we can use it at runtime without needing to build it or bring in the machinery required to build it." It presupposes that you don't need the perf tricks of the full regex engine search.
OK, so when you say native code, it sounds like you're thinking about something that builds a DFA and then converts that DFA into a state machine with Rust enums and a The compile time regexes that exist in From there, it's perhaps easier to see how a compile-time NFA would work. You just need a slightly chunkier search runtime that knows how to interpret the NFA and run the PikeVM. It is chunkier, but only slightly so. You don't need the regex parser (which is, currently, a big chunk of the binary size), you don't need a compiler. You just need an NFA, the code to interpret it and the PikeVM. The PikeVM is itself quite small. Here's the problem: "native code regex" and "compile time regex" and "const fn regex" are all kind of referring to the same idea, but there are actually many many different ways those ideas can be executed. The design space here is huuuuuge.
#656 needs to get done first. Once that's done, the next step would probably be to write a parser that converts concrete syntax to a The key to understand is that the current regex crate was built without much attention to binary size, because failure modes and perf end up mattering more of the time to most people. I did add crate features to disable a lot of stuff, but they only take you so far.
Yes. To be honest, I assumed that you had already considered this and dismissed it for one reason or another. :P |
Of course. Making
Yes, that's what I meant. Sorry if this sounds naive. My understanding of regex engines is based on a university course on formal grammars, so I have limited intuition of the real-word trade-offs involved :)
Interesting, I didn't know about these. Thanks for the links!
This sounds promising.
Indeed!
Now that I thought about it more, I see two problems with this solution. |
If that were the case, it would imply that compile time regexes support things like capture matches. If you fix that as a requirement, that has quite a large implication on what exactly is being built at compile time. And if you're copying the existing API instead of providing something much more low level, my guess is that users of said API are going to assume that the compile-time API would be at least as fast at searching in all cases. That in turn implies that literal optimizations and basically the full regex engine and all its nooks and crannies are available at compile time. This would in turn negate a lot of the binary size advantage (although not all), and is also a very very very large task. The thing is, building a data structure that is only used at runtime and building a data structure that can also be loaded cheaply at compile time are two very very different things. I did exactly that for DFAs in regex-automata (of which you are encouraged to peruse the source code), and it adds probably an entire order of magnitude of complexity to the code that wouldn't have otherwise needed to be there without the "compile time" use case. More to the point, a lot of that code requires This sort of design (which AFAIK is required to support "compile time API that is a drop-in replacement for the current API") just infects everything from how you read your data types to how you design your data types themselves. I do think it might be nice to do, but it is an enormous under-taking. We're not even just talking about regex internals either, both the The comparatively simpler thing is to provide lower level compile time APIs.
Yeah so this and "drop in replacement for current API" are for the most part incompatible ideas. The former requires handling capture groups and start-of-match reporting. But a "state machine in code using enums" is just as powerful as the table-based DFAs that
No. It would be nice to do it, but it's nowhere near my priority. I would expect them to be competitive.
Indeed, if you require both regex engines to behave the same, then you either need to use a Javascript-regex compatible regex engine in the backend, or you need to bite the bullet on binary size with the And remember, building DFAs from regexes takes worst case exponential time. The Unfortunately, the theory you learned in school is almost useless in this context. The |
Thanks for the detailed answers! This was insightful.
Not if it's positioned as a binary-size optimization (like "-Os"). But I understand that this use-case is rather niche. The two cases that come to mind are web and embedded.
I've checked what happens if you do
I've benchmarked To sum up. There doesn't seem to be any obviously superior solution for reducing the binary size of an application that needs to parse regexes. I've tried to list all the options that we discussed. Solutions on R1. Add a feature flag to make Solutions on the user side: U1. Produce the compiled regex via (Frankly speaking, I'm inclined to go with U3 for now, because everything else seems surprisingly complicated) Is this an accurate summary of this discussion? Do you think it makes sense to prioritize something like this for the |
This already exists and it sounds like you've already taken advantage of it. And note that it isn't just about execution speed. You can disable Unicode too. That isn't about speed, but about functionality. If you keep Unicode around, then there's not much that can be done about binary size with just crate features. It might be possible to get a little smaller, but there's not really much left to pursue in this avenue.
I think the better way of framing this is to "provide constant time deserialization APIs from
The "compile DFA to native code" bit is probably not something I plan on pursuing personally. The latter half of your paragraph doesn't really follow. The latter half of your paragraph is basically: "provide constant time deserialization APIs from
I don't get paid to work on this. I don't schedule my time this way. And depending on what "something like this" is referring to, you're talking about man-years of effort. There is no "near term" here I think. |
Also, the other part of this is that I don't know whether I want to do some of these things at all. And I don't mean "not work on them" (I am actually personally interested in the work), but rather, whether they're a good fit. The biggest concern I have is that providing those constant time deserialization APIs (which is ultimately what you need to be able to execute regex searches without pulling in the full regex compilation machinery) creates a huge amount of code complexity and requires quite a bit of I'm not saying anything for sure at the moment, but I just wanted to make it clear that the ideas themselves are not necessarily unambiguous good directions to head in. |
I'm going to close this because I don't really think there is anything meaningfully actionable here. Basically what this issue boils down to is that the regex crate binary size is "big" for some sense of "big," and particularly in contexts where size might matter for. For example, in WASM. The regex crate already exposes a number of features to trim its binary size down considerably, but the end result is still kind of big. I don't think there's really much more that can be done here, other than building a In theory, as discussed in this issue, another way out is to make it possible to build regexes into a serialized form, and then provide a way to use the regex's search runtime only along with the deserialization APIs. This could lead to considerably smaller overall binary size, but the complexity (as discussed above) in doing this is enormous. It isn't going to happen any time soon, and I'm not even sure it should happen. In particular, it opens up a huge can of worms with respect to maintaining a stable binary format for regexes, which in turn makes evolution of the implementation itself far more precarious. |
Is a stable binary format really necessary? If the serialized form is produced by a proc macro ( Still, I see that building a compile-time regex parser (however one defines compile-time) is a huge effort which increases maintenance burden and has serious downsides. So closing the issue seems fair. A |
You're right that this path is technically part of the design space, and would address your specific problem. But IMO, if we got all the way to this point, then it will very easily beget the most obvious next request: expose the binary format so that it can be used more flexibly. And if I resolutely reject doing so, it's almost certain that people will use it anyway. Because AIUI in order for something like |
Hmm. Yeah, that sounds plausible.
I think so too. IIUC you'll have to put the macro into a separate public crate due to Rust/Cargo limitations. And then you can't stop people from using that crate directly. |
Right, although to be clear here, what I'd expect would need to happen is that |
I would like to once again raise the question of compile-time regular expressions. They were once provided by
regex-macros
crate, but are no longer supported.If I understand correctly, compiled-time regular expressions were deems unnecessary because their primary use-case was compile-time pattern verification, which can now be achieved by other means, like
invalid_regex
Clippy lint orlazy-regex
crate.But there is another reason one might want to avoid including a regex compiler into the application, and that is binary size. This becomes especially important with WASM targets. The increase in WASM binary size from the
regex
crate is very noticeable. In my experiments it varies from ~200 KiB to ~600 KiB depending on the features. I made a small demo, check it out to see for yourself: https://github.com/amatveiakin/regex-wasm-demoEven 200 KiB is a lot for a web app, and should best be avoided. And I don't think there is a better way to do this than building the regex at compile-time. What do you think?
The text was updated successfully, but these errors were encountered: