-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary Size #583
Comments
I'm aware of the large binary size. Could you help me understand the purpose of this ticket? Is it a request to reduce the binary size? Without doing any actual work, my guess is that the large binary size is attributable to the large Unicode tables in It would be helpful to put the binary size into context, e.g., by comparing it with other mature regex engines with similar performance goals such as PCRE2. (FWIW, one of my longer term goals is to provide a "lite" regex library which compiles faster and has a smaller binary size, but gives up performance and potentially some functionality, such as Unicode support.) Moreover, I don't think regex can even work on resource constrained systems today, in practice. At least, not where 1MB would be an issue, since the standard library is a hard dependency at present. At some point, I'd like to add a no-std-with-alloc mode, and that might open some doors and put some pressure on reducing the binary size. |
In addition to already said above, it's worth noting that if you're targetting small size, using the default release config doesn't make much sense for comparison, as it focuses on balanced speed of compilation+execution. Instead, adding these few lines to Cargo.toml can already shave off ~320 KB (30% compared to original 1 MB on my machine) of a sample binary: [profile.release]
lto = true
codegen-units = 1
opt-level = "z" |
@BurntSushi thanks a lot for explanation! Using tricks from here allowed me to reduce size from 1.4M to 1M. I think having "lite" regex version would be amazing. And thanks for your work :) |
Right-- we already use
I certainly would like to reduce the binary size, but I'm not intending to request anything from you aside from a place to discuss possible improvements and track ongoing work. I'm planning to spend some time investigating possible improvements, and I figured an issue on this repo would be a good place to discuss and share progress with others who may have the same goal.
Yes, I think so as well (WRT unicode tables). One option I had considered was investigating the possibility of a regex-syntax-free
In Fuchsia, we have a large number of separate Rust binaries that range from 250-750KB, all of which use the standard library. Adding a |
It still seems like small potatoes to me. Could you say more about the decision procedure here? Could you also say the size at which I kind of feel like your decision procedure here is pretty opaque, so without a target, it's a bit tricky to know how to make progress. For example, if you said the crate had to reduce its size by an order of magnitude, then it would be easy for me to say that For example, we could consider adding a
That's interesting, but it's not obvious to me that it's a clear win. At that point, the compiled regex program will become part of your binary, and regex programs aren't themselves exactly small. There's probably also some implementation difficulties in getting this to work, but they are perhaps not insurmountable. |
Perhaps I'm not explaining clearly: for our programs, adding a use of regex increases their size by 25-50% (~200KB addition to a 250-500KB program). 50KB would probably be small enough to consider general usage. Unfortunately I can't be too precise about our specific project goals, but to give some sense of scale, single digit megabyte increases or decreases to the system matter quite a bit (enough to be worth spending multiple engineering days on them). With that in mind, the per-target effort reduced by including the
This sounds like a great thing to experiment with! I think the biggest wins are probably likely to come from dropping unnecessary copies of the unicode tables (e.g. in binaries whose regular expressions don't require them), but I think that's a much harder to achieve goal, and minimizing |
@cramertj All righty. I'll do some experimenting soonish w.r.t. inlining and Unicode and see what I can come up with. Longer term, the "lite" regex crate will probably be the better bet, although I was still planning on using Also, I feel obliged to mention |
Okay, so as far as I can tell, a good chunk of the binary size (~60KB) belongs The next thing I tried was removing all of the Unicode tables from Moreover, removing the Unicode tables shaves off about 2 seconds of compile The next thing I tried was to remove all of the pertinent use regex::Regex;
fn main() {
let data = std::fs::read("wat").unwrap();
assert!(regex::bytes::Regex::new(r"\w").unwrap().is_match(&data));
let data = std::fs::read("wat").unwrap();
let _ = regex::bytes::Regex::new(r"\w").unwrap().find(&data).unwrap();
let data = std::fs::read("wat").unwrap();
let _ = regex::bytes::Regex::new(r"\w").unwrap().captures(&data).unwrap();
let data = std::fs::read("wat").unwrap();
let sdata = String::from_utf8(data).unwrap();
assert!(Regex::new(r"\w").unwrap().is_match(&sdata));
let data = std::fs::read("wat").unwrap();
let sdata = String::from_utf8(data).unwrap();
let _ = Regex::new(r"\w").unwrap().find(&sdata).unwrap();
let data = std::fs::read("wat").unwrap();
let sdata = String::from_utf8(data).unwrap();
let _ = Regex::new(r"\w").unwrap().captures(&sdata).unwrap();
} This hits on the three major code paths in the regex engine (is_match, find Overall, removing all of the
And then with the annotations removed:
So that's a reduction of about 36%. Which seems pretty good. This is further
And now without inlining:
You can see here that the various But a 100KB still looks to be an overall fairly small fraction of the total It sounds to me like 700KB is not good enough for your use case. Is that right?
N.B. As my numbers show above, I'm seeing a much greater increase in |
OK, I've begun to make some progress on this particular issue. In light of the recent focus on dependency weight, I decided to try my hand at making a "regex lite" a reality. But instead of achieving that by building a completely separate crate (which was my original intent), I'm now thinking that it can be done by use of crate features. This overlaps a great deal with this issue, so I decided to just tackle them both at the same time. The high level idea is that it should be possible to strip regex down to its bare essentials: a non-Unicode aware parser, no literal optimizations, no aggressive inlining, no lazy DFA and no fast caching. Collectively, that should reduce compile times, reduce binary size and reduce the dependency count of Thus far, I've done the work to make all Unicode data optional. Effectively, this surfaces itself via additional errors when building a regex. For example, if you turn off the Unicode case tables, but try to compile An alternative strategy would be to permit all these constructs to continue to compile, but simply omit any additional Unicode processing. Aside from certain things being a little weird, the biggest problem with this is that someone might disable Unicode features, write Thus far, my planned feature breakdown for
I plan to re-export these features from With just dropping the Unicode data, compile times improve by 30% in debug mode, and 15% in release mode:
I was kind of hoping for a better improvement here, but as my analysis above showed, it turns out that the regex parser and its supporting infrastructure is pretty beefy all on its own. |
Does this mean dropping extra matching engines and literal optimizations as well? I didn't see any features for turning off different parts of the back end, which is why I ask. On the face of it, it seems like it could be desirable to strip everything down to just the PikeVM. If you do end up deciding to provide feature flags for each of the extra matching engines, it means I should add one to the onepass patch right? |
Yeah, the knobs don't exist. I'm in the process of adding them. I don't necessarily want a knob for every matching engine. For instance, I don't see a reason to provide a knob to turn off the backtracker. The code is small. The lazy DFA, on the other hand, is substantially larger.
I would hold off on making changes to onepass. My very loose plan at the moment is to figure out how to bring the onepass engine into I'm sorry I haven't been able to merge it yet, but I just haven't been able to push myself to do it because I'm so concerned about bugs in the existing infrastructure, which is extremely brittle. But that's a separate conversation... |
Gocha. Definitely not trying to rush you. Do you mean existing infrastructure in the rest of the regex crate or in the onepass patch? Is it anything I can help with? |
No, not onepass. I mean the existing infrastructure. I think I'm at a point where trying to split up the work would be counter-productive. There's too much shifting, and the full picture isn't quite clear to me yet. I also expect this to take months or maybe even a year to really finish. Basically, I'm looking to solve a lot of the fundamental issues that plague Once that base is there, I'm really hoping things will open up for many more optimization opportunities, including things like onepass. Thank you for your offer though. :-) |
Ok got it. Those sort of architectural decisions do seem like something you have to do on your own. |
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
All right, I've done what I can I think. I've managed to decrease the binary size overhead of I spent a couple additional hours trying to see if I could shrink the binary size even more. But the tools available to me to debug and fix that sort of issue appear to be fairly limited. The parser is definitely taking up a chunk of space, but there's really no one huge thing left that's being reported by I've posted a bit more of an analysis in #613. |
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
This isn't yet focused on any particular issue, but if the maintainers are okay with it, I'd like to use this as an issue to track improvements to binary size. Today, a simple hello world build with
--release
and stripped is ~240KB. Addingregex::Regex::new("x").unwrap()
to that binary and recompiling + restripping gives ~1.25MB-- that's an increase of over 1 megabyte, which is quite large when targeting resource-constrained systems.cargo bloat --release
output:cargo bloat --release --crates
output:The text was updated successfully, but these errors were encountered: