-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: Potential removal of the Regex crate #108
Comments
Vanilla Rust is enough but I also propose the usage of the winnow crate to produce clearer code. |
This thought has crossed my mind. The C version of the library (libxlsxwriter) doesn't use regexes and instead uses reasonably simple hand rolled parsing code. So it is feasible. However, just to take a step back. What savings in compile time and binary size would a change like this give us? Is there a way to estimate that without mocking out the regex code? Also, what about a middle ground of using regex-lite from #106 and where I add workarounds for the Unicode limitations? And finally, |
There are some improvements just by changing to regex-lite as you can read in their motivation. TLDR: regex-lite compiles around 60% faster and its size is around 80% lighter (471KB less). I can't thing of a way to measure the impact of this proposed change to this crate specifically without implementing it for real. Right now, we are using the default features of the regex crate: "std", "perf", "unicode", "regex-syntax/default". The above benchmark only used the "std" feature. I would expect that the compilation times and binary size reductions would be even greater if we use hand rolled code. Also, regex-lite is not as fast as regex, while hand rolled code would be. Because of those points I would not recommend the middle ground. The standard library has a lot of care when dealing with Unicode stuff so I would not worry about those limitations if we were to go ahead with my proposal. About dev dependencies, we can do as we please haha. There's no impact to user facing code. I recommend keeping the regex crate in that case to avoid any edge cases. |
Check out these benchmarks also: https://github.com/BurntSushi/rebar |
That isn't a major concern since the regexes aren't used on the fast path. However overall I think you are right that if we are going to replace it with something then it would be best to remove it altogether.
Lol. Some work companions wrote/work on one of the better libraries in that benchmark.
Probably it is just worth making the changes and seeing what the resulting size/time difference is. Some of the regexes are simple and can be replaced with String/str Do you want me to kick it off on a branch and start with some of the lower hanging regexes or do you just want to jump in? |
I prefer If you start with the low hanging ones. I plan to start working on the others this thursday because I have some deadlines before that. |
@adriandelgado I've created a branch called
If you want to review as I go then please do. |
I've converted the utility.rs file as well. It is starting to take shape: I had to comment out 2 tests for worksheet name quoting that contain emoji characters. These aren't very important since the default will be to quote the worksheet names that contain the emoji characters. While this isn't strictly correct it isn't an error in Excel. To work around this would require a match for all the emoji Unicode characters: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEmoji%3DYes%3A%5D&esc=on&g=&i= This is 1,424 code points to match against. Some (many) could be condensed into ranges (the above tool does that with "Abbreviate"): @adriandelgado can you think of an efficient way of matching that number of characters or ranges? It doesn't need to be super efficient because it is used in a function that won't be called often (or at all). Maybe a match statement will be sufficient. A harder problem will be escaping the Excel "future" functions: https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/formula.rs#L995 These could be on a fast path when writing a lot of formulas so I need to take care to make it efficient. |
I have converted chart.rs as well: |
2 more components completed: I am reminded about how when I was young I wanted to build a lego helicopter but I only had horizontal rotation pieces for wheels and I didn't have a vertical rotation piece I could use for the rotors. So instead I came up with ways of using the horizontal piece vertically. That is what this exercise is starting to feel like. Anyway, I'm almost there. I'll finish off the last regex replacement in the formula.rs file and then we will see if this refactoring was worth it in terms of compilation time/size. |
Sorry I haven't answered, I'm going to check out the changes thoroughly in a couple of days. In the mean time you can check out how the regex crate matches emojis and other similar stuff (they just use a table) https://github.com/rust-lang/regex/blob/ab88aa5c6824ebe7c4b4c72fe5191681783b3a68/regex-syntax/src/unicode_tables/property_bool.rs#L4419 Also, to match a lot of fixed strings very efficiently the regex crate uses Aho-Corasick. rust_xlsxwriter already has this crate as a dependency due to using regex but it is lighter weight (it just matches fixed strings). You can also try to use: if matches!(haystack, "some_string_1" | "some_string_2" | "some_string_3" | ... | "some_last_string") {
// ...
} Edit: never mind, the |
I'm writing some code suggestions in some of the commits in the |
I just took a look at |
Sounds good. I hope to get my side of the changes done by Thursday/Friday. You can jump in then. |
Agreed. I had a look at it and it is more or less perfect. I'll add it as a test although I may just go ahead and write a formula parser anyway. There is another area where I would need that in the future so it should be worth the effort. |
I've pushed the last piece of the refactoring for the formula.rs module: Note, this needs some refactoring and some optimization which I will work on later, so watch out for a force push. Anyway, the cargo clean
sleep 2
time cargo build
# v0.74.0:
real 0m8.531s
user 0m13.689s
sys 0m2.363s
# no_regex branch:
real 0m8.070s
user 0m13.685s
sys 0m2.330s The hello_world exe size is halved (although this is really just a delta which should be the same any sized app):
@adriandelgado or @dodomorandi could you maybe test as well to see if you get similar results. Also, @adriandelgado could you check if I got the OnceLock initialization right. It works but if don't know if it should or could be global: https://github.com/jmcnamara/rust_xlsxwriter/blob/no_regex/src/formula.rs#L1048 |
Maybe the build time didn't change significantly because cargo was already compiling the regex crate parallel to some other crate. The executable size reduction is significant though. Also the usage of pure string manipulation opens the door for more optimizations in the future. About static variables: They are always "global" but not always accesible. About OnceLock usage: you can use In formula.rs line 973: Careful with Also, It seems like you didn't need to use |
Got it, thanks.
Good catch. That was a bug. I suppose that I could use
Yes, there were a few edge case that meant that parsing was better than raw match/replace. I had avoided doing this previously (in the other language versions too) but overall it is a better solution (if I got it right). I'm made those changes and some others and forces pushed to main: I also need to make some doc changes since some of the formula APIs are no longer necessary now. |
I fixed the emoji match issue like this: https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/utility.rs#L551 The big match may be inefficient (I don't know how the compiler handles cases like this) but it is in a function that is rarely called so it doesn't need optimization. There may be emoji edge cases that I am missing but if there are then the comparison will fail in safe mode. So, in short it is good enough. That is the last of the work on this so I will merge it back to main. @adriandelgado I am missing one of your suggested optimizations and the other one I probably won't use. If you want to submit a PR for that please do. I will leave this on main for about a week while I work on another feature and then I will publish it. Thanks for the input to date. |
Wow, looks like a very nice work indeed! Thank you all for improving the crate! @jmcnamara I confirm that I see comparable improvements in build size (which is impressive to be honest). I was briefly looking at the changes, and maybe I have a suggestion, but keep in mind that it is just a theoretical thing -- it probably does not matter at all if it is not anything relevant when profiling and benchmarking. Said that: the On the other hand, if you are able to see some parts of the code that occupy a considerable amount of space in a flamegraph (maybe using the examples, don't know), it could be worth to focus on these parts. In any case, nice work indeed! ❤️ |
@dodomorandi Thanks for the feedback.
@adriandelgado pointed that out too with a suggested fix: 6aee84e#comments I've merged that upstream with Adrian as the author. Also, I've merged everything to main. |
I've released this in |
Feature Request
While it is true that Rust has one of the fastest Regex libraries available, nothing beats pure string manipulation. In other languages like JavaScript and Python this is not feasible because manipulating strings directly would be too slow.
While reading #106 it occurred to me that this crate would be even faster, quicker to compile, and would support all of the features that the current
regex
feature flag supports without bloating the resulting binaries if we translate every regex usage to pure Rust code.In Rust, there's only two disadvantages of doing this:
Number one its easy to solve because I would volunteer the work to do it. But number two depends on your judgement.
I think it is worth it to reduce the dependency count, the compilation times, and to increase the runtime performance. I also think that the current regexes do not change very often because they've had time to mature after all these years. That means is not that much of a maintainability burden in my opinion.
What do you think? Is this something you would be interested in pursuing?
The text was updated successfully, but these errors were encountered: