Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Potential removal of the Regex crate #108

Closed
adriandelgado opened this issue Aug 25, 2024 · 23 comments
Closed

feature request: Potential removal of the Regex crate #108

adriandelgado opened this issue Aug 25, 2024 · 23 comments
Assignees
Labels
enhancement New feature or request in progress Currently under investigation or development

Comments

@adriandelgado
Copy link
Contributor

adriandelgado commented Aug 25, 2024

Feature Request

While it is true that Rust has one of the fastest Regex libraries available, nothing beats pure string manipulation. In other languages like JavaScript and Python this is not feasible because manipulating strings directly would be too slow.

While reading #106 it occurred to me that this crate would be even faster, quicker to compile, and would support all of the features that the current regex feature flag supports without bloating the resulting binaries if we translate every regex usage to pure Rust code.

In Rust, there's only two disadvantages of doing this:

  • The initial development time-cost
  • The difference in maintainability compared to the other libraries you maintain (they would still be using regexes)

Number one its easy to solve because I would volunteer the work to do it. But number two depends on your judgement.

I think it is worth it to reduce the dependency count, the compilation times, and to increase the runtime performance. I also think that the current regexes do not change very often because they've had time to mature after all these years. That means is not that much of a maintainability burden in my opinion.

What do you think? Is this something you would be interested in pursuing?

@adriandelgado
Copy link
Contributor Author

Vanilla Rust is enough but I also propose the usage of the winnow crate to produce clearer code.

@jmcnamara
Copy link
Owner

jmcnamara commented Aug 25, 2024

While reading #106 it occurred to me that this crate would be even faster, quicker to compile, and would support all of the features that the current regex feature flag supports without bloating the resulting binaries if we translate every regex usage to pure Rust code.

This thought has crossed my mind. The C version of the library (libxlsxwriter) doesn't use regexes and instead uses reasonably simple hand rolled parsing code. So it is feasible.

However, just to take a step back. What savings in compile time and binary size would a change like this give us? Is there a way to estimate that without mocking out the regex code?

Also, what about a middle ground of using regex-lite from #106 and where I add workarounds for the Unicode limitations?

And finally, regex or regex-lite could still be used as a dev-dependency for the test code, right?

@adriandelgado
Copy link
Contributor Author

adriandelgado commented Aug 25, 2024

There are some improvements just by changing to regex-lite as you can read in their motivation. TLDR: regex-lite compiles around 60% faster and its size is around 80% lighter (471KB less). I can't thing of a way to measure the impact of this proposed change to this crate specifically without implementing it for real.

Right now, we are using the default features of the regex crate: "std", "perf", "unicode", "regex-syntax/default". The above benchmark only used the "std" feature.

I would expect that the compilation times and binary size reductions would be even greater if we use hand rolled code. Also, regex-lite is not as fast as regex, while hand rolled code would be. Because of those points I would not recommend the middle ground. The standard library has a lot of care when dealing with Unicode stuff so I would not worry about those limitations if we were to go ahead with my proposal.

About dev dependencies, we can do as we please haha. There's no impact to user facing code. I recommend keeping the regex crate in that case to avoid any edge cases.

@adriandelgado
Copy link
Contributor Author

adriandelgado commented Aug 25, 2024

Check out these benchmarks also: https://github.com/BurntSushi/rebar
regex-lite is the second fastest to compile but has the slowest runtime performance, way slower than regex.

@jmcnamara
Copy link
Owner

jmcnamara commented Aug 25, 2024

Also, regex-lite is not as fast as regex

That isn't a major concern since the regexes aren't used on the fast path. However overall I think you are right that if we are going to replace it with something then it would be best to remove it altogether.

Check out these benchmarks also: https://github.com/BurntSushi/rebar

Lol. Some work companions wrote/work on one of the better libraries in that benchmark.

I can't think of a way to measure the impact of this proposed change to this crate specifically without implementing it for real.

Probably it is just worth making the changes and seeing what the resulting size/time difference is.

Some of the regexes are simple and can be replaced with String/str starts_with(), contains() or matches() but some will be trickier.

Do you want me to kick it off on a branch and start with some of the lower hanging regexes or do you just want to jump in?

@adriandelgado
Copy link
Contributor Author

I prefer If you start with the low hanging ones. I plan to start working on the others this thursday because I have some deadlines before that.

Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
Repository owner deleted a comment Aug 26, 2024
@jmcnamara jmcnamara self-assigned this Aug 26, 2024
@jmcnamara jmcnamara added enhancement New feature or request in progress Currently under investigation or development labels Aug 26, 2024
jmcnamara added a commit that referenced this issue Aug 26, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 26, 2024

@adriandelgado I've created a branch called no_regex and I've started to work through the files/modules that use regexes one by one. I'll add update as I go:

If you want to review as I go then please do.

jmcnamara added a commit that referenced this issue Aug 26, 2024
jmcnamara added a commit that referenced this issue Aug 26, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 26, 2024

Here is the replacement for the xmlwriter:

I'm starting to regret this now. :-)

Update: also for the lib tests:

jmcnamara added a commit that referenced this issue Aug 26, 2024
jmcnamara added a commit that referenced this issue Aug 27, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 27, 2024

I've converted the utility.rs file as well. It is starting to take shape:

I had to comment out 2 tests for worksheet name quoting that contain emoji characters. These aren't very important since the default will be to quote the worksheet names that contain the emoji characters. While this isn't strictly correct it isn't an error in Excel.

To work around this would require a match for all the emoji Unicode characters:

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEmoji%3DYes%3A%5D&esc=on&g=&i=

This is 1,424 code points to match against. Some (many) could be condensed into ranges (the above tool does that with "Abbreviate"):

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEmoji%3DYes%3A%5D&abb=on&esc=on&g=&i=

@adriandelgado can you think of an efficient way of matching that number of characters or ranges? It doesn't need to be super efficient because it is used in a function that won't be called often (or at all). Maybe a match statement will be sufficient.

A harder problem will be escaping the Excel "future" functions:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/formula.rs#L995

These could be on a fast path when writing a lot of formulas so I need to take care to make it efficient.

@jmcnamara
Copy link
Owner

I have converted chart.rs as well:

jmcnamara added a commit that referenced this issue Aug 27, 2024
jmcnamara added a commit that referenced this issue Aug 27, 2024
@jmcnamara
Copy link
Owner

2 more components completed:

I am reminded about how when I was young I wanted to build a lego helicopter but I only had horizontal rotation pieces for wheels and I didn't have a vertical rotation piece I could use for the rotors. So instead I came up with ways of using the horizontal piece vertically. That is what this exercise is starting to feel like.

Anyway, I'm almost there. I'll finish off the last regex replacement in the formula.rs file and then we will see if this refactoring was worth it in terms of compilation time/size.

@adriandelgado
Copy link
Contributor Author

adriandelgado commented Aug 27, 2024

Sorry I haven't answered, I'm going to check out the changes thoroughly in a couple of days. In the mean time you can check out how the regex crate matches emojis and other similar stuff (they just use a table) https://github.com/rust-lang/regex/blob/ab88aa5c6824ebe7c4b4c72fe5191681783b3a68/regex-syntax/src/unicode_tables/property_bool.rs#L4419

Also, to match a lot of fixed strings very efficiently the regex crate uses Aho-Corasick. rust_xlsxwriter already has this crate as a dependency due to using regex but it is lighter weight (it just matches fixed strings). You can also try to use:

if matches!(haystack, "some_string_1" | "some_string_2" | "some_string_3" | ... | "some_last_string") {
    // ...
}

Edit: never mind, the matches! suggestion would be somewhat inefficient.

@adriandelgado
Copy link
Contributor Author

I'm writing some code suggestions in some of the commits in the no_regex branch. In the case you accept the changes suggested, I would like to add them myself in a PR.

@adriandelgado
Copy link
Contributor Author

adriandelgado commented Aug 27, 2024

I just took a look at formula.rs and the aho-corasick crate is definitively the way to go. It is very performant and it is also what regex uses underneath each time you build a regex like r"str1|str2|...|strn".

@jmcnamara
Copy link
Owner

I'm writing some code suggestions in some of the commits in the no_regex branch. In the case you accept the changes suggested, I would like to add them myself in a PR.

Sounds good. I hope to get my side of the changes done by Thursday/Friday. You can jump in then.

@jmcnamara
Copy link
Owner

I just took a look at formula.rs and the aho-corasick crate is definitively the way to go. It is very performant and it is also what regex uses underneath each time you build a regex like r"str1|str2|...|strn".

Agreed. I had a look at it and it is more or less perfect. I'll add it as a test although I may just go ahead and write a formula parser anyway. There is another area where I would need that in the future so it should be worth the effort.

jmcnamara added a commit that referenced this issue Aug 29, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 29, 2024

I've pushed the last piece of the refactoring for the formula.rs module:

Note, this needs some refactoring and some optimization which I will work on later, so watch out for a force push.

Anyway, the no_regex branch is now regex clean. The build times look more or less the same:

cargo clean

sleep 2

time cargo build

# v0.74.0:
real    0m8.531s
user    0m13.689s
sys     0m2.363s

# no_regex branch:
real    0m8.070s
user    0m13.685s
sys     0m2.330s

The hello_world exe size is halved (although this is really just a delta which should be the same any sized app):

v0.74.0 no_regex
Debug 9.2M 4.2M
Release 3.4M 1.6M

@adriandelgado or @dodomorandi could you maybe test as well to see if you get similar results.

Also, @adriandelgado could you check if I got the OnceLock initialization right. It works but if don't know if it should or could be global: https://github.com/jmcnamara/rust_xlsxwriter/blob/no_regex/src/formula.rs#L1048

@adriandelgado
Copy link
Contributor Author

Maybe the build time didn't change significantly because cargo was already compiling the regex crate parallel to some other crate. The executable size reduction is significant though. Also the usage of pure string manipulation opens the door for more optimizations in the future.

About static variables: They are always "global" but not always accesible. FUTURE_FUNCTIONS is only possible to be referenced from the future_functions fn but it is stored in the binary as any other global. Just the ability to be referenced/named is scoped. The way you are using it is fine.

About OnceLock usage: you can use .get(function) right after calling get_or_init. There's no need to unwrap later because get_or_init always returns a valid reference. The closure inside get_or_init is guarantied to be called only once.

In formula.rs line 973: Careful with formula.chars().enumerate(). You most likely want to use char_indices. An unicode char could span several bytes in a utf-8 string.

Also, It seems like you didn't need to use aho-corasick. Good stuff! 👍

jmcnamara added a commit that referenced this issue Aug 29, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 29, 2024

There's no need to unwrap later because get_or_init always returns a valid reference.

Got it, thanks.

In formula.rs line 973: Careful with formula.chars().enumerate(). You most likely want to use char_indices. An unicode char could span several bytes in a utf-8 string.

Good catch. That was a bug. I suppose that I could use formula.as_bytes() and iterate over it as well. I don't know if you can still use the nice Char::is_alphanumic() functions then though.

Also, It seems like you didn't need to use aho-corasick. Good stuff!

Yes, there were a few edge case that meant that parsing was better than raw match/replace. I had avoided doing this previously (in the other language versions too) but overall it is a better solution (if I got it right).

I'm made those changes and some others and forces pushed to main:

I also need to make some doc changes since some of the formula APIs are no longer necessary now.

jmcnamara added a commit that referenced this issue Aug 30, 2024
jmcnamara added a commit that referenced this issue Aug 30, 2024
jmcnamara added a commit that referenced this issue Aug 30, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 30, 2024

I fixed the emoji match issue like this:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/utility.rs#L551

The big match may be inefficient (I don't know how the compiler handles cases like this) but it is in a function that is rarely called so it doesn't need optimization.

There may be emoji edge cases that I am missing but if there are then the comparison will fail in safe mode.

So, in short it is good enough.

That is the last of the work on this so I will merge it back to main. @adriandelgado I am missing one of your suggested optimizations and the other one I probably won't use. If you want to submit a PR for that please do.

I will leave this on main for about a week while I work on another feature and then I will publish it.

Thanks for the input to date.

@dodomorandi
Copy link
Contributor

Wow, looks like a very nice work indeed! Thank you all for improving the crate!

@jmcnamara I confirm that I see comparable improvements in build size (which is impressive to be honest).

I was briefly looking at the changes, and maybe I have a suggestion, but keep in mind that it is just a theoretical thing -- it probably does not matter at all if it is not anything relevant when profiling and benchmarking. Said that: the FUTURE_FUNCTIONS probably could gain some performance using a perfect hash function, and maybe it could also help is_emoji (but I am less sure in that case, and as you already said it is probably used rarely). And another little thing that caught my eye is is_escaped, that could iterate through url one single time instead of multiple times. As I said, these cases could be irrelevant to a real benchmark, so maybe it makes more sense to check if it makes sense to try alternative approaches.

On the other hand, if you are able to see some parts of the code that occupy a considerable amount of space in a flamegraph (maybe using the examples, don't know), it could be worth to focus on these parts. In any case, nice work indeed! ❤️

jmcnamara pushed a commit that referenced this issue Aug 30, 2024
@jmcnamara
Copy link
Owner

jmcnamara commented Aug 30, 2024

@dodomorandi Thanks for the feedback.

And another little thing that caught my eye is is_escaped, that could iterate through url one single time instead of multiple times.

@adriandelgado pointed that out too with a suggested fix: 6aee84e#comments

I've merged that upstream with Adrian as the author.

Also, I've merged everything to main.

jmcnamara added a commit that referenced this issue Aug 31, 2024
jmcnamara added a commit that referenced this issue Aug 31, 2024
@jmcnamara
Copy link
Owner

I've released this in rust_xlsxwriter v0.75.0. Thanks for the input.

@jmcnamara jmcnamara mentioned this issue Sep 2, 2024
42 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in progress Currently under investigation or development
Projects
None yet
Development

No branches or pull requests

4 participants
@jmcnamara @dodomorandi @adriandelgado and others