Increased runtime memory usage in 1.9 #1027

hamiltop · 2023-07-07T15:54:20Z

What version of regex are you using?

1.9

Describe the bug at a high level.

The upgrade from 1.8.4 to 1.9 significantly increased runtime memory usage. (375MB vs 175MB)

Heap analysis suggests that this is due to regex_automata::nfa::thompson::backtrack::Cache::new::hc3a7922de221986c. (900 allocations at 256kb each, which a little more than the 200MB difference). Here's the full stack trace for those allocations:

alloc::alloc::Global::alloc_impl::hc0d2fbcc9e259c41	
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hff80cfdf399332ed	
alloc::raw_vec::finish_grow::h1d4f4e7469250062	
alloc::raw_vec::RawVec$LT$T$C$A$GT$::grow_amortized::hb0b597f5657dcfe2	
alloc::raw_vec::RawVec$LT$T$C$A$GT$::reserve::do_reserve_and_handle::hafd8f04576418cba	
alloc::vec::Vec$LT$T$C$A$GT$::reserve::h9ddd934bc4485398	
alloc::vec::Vec$LT$T$C$A$GT$::extend_with::h866264405059bb08	
alloc::vec::Vec$LT$T$C$A$GT$::resize::h22b6e20adc554f2b	
regex_automata::nfa::thompson::backtrack::Visited::reset::hcd3ecf9f7bc5e143	
regex_automata::nfa::thompson::backtrack::Visited::new::h28c46b03d50c9004	
regex_automata::nfa::thompson::backtrack::Cache::new::hc3a7922de221986c	
regex_automata::nfa::thompson::backtrack::BoundedBacktracker::create_cache::heb005f34fc005666	
regex_automata::meta::wrappers::BoundedBacktrackerCache::new::_$u7b$$u7b$closure$u7d$$u7d$::h66c74de55ee73b0e	
core::option::Option$LT$T$GT$::map::hbfd3f56f2a53617b	
regex_automata::meta::wrappers::BoundedBacktrackerCache::new::h6625b6730d32ff87	
regex_automata::meta::wrappers::BoundedBacktracker::create_cache::h2c635eefad95c3ac	
_$LT$regex_automata..meta..strategy..Core$u20$as$u20$regex_automata..meta..strategy..Strategy$GT$::create_cache::haf98298fa360833d	
regex_automata::meta::regex::Builder::build_many_from_hir::_$u7b$$u7b$closure$u7d$$u7d$::ha3b32a0a124f1e6c	
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..Fn$LT$Args$GT$$GT$::call::h2983e1b0fb6fda94	
regex_automata::util::pool::inner::Pool$LT$T$C$F$GT$::get_slow::hd0a3e0958d9d9787	
regex_automata::util::pool::inner::Pool$LT$T$C$F$GT$::get::hc70ab56136e5ccdb	
regex_automata::util::pool::Pool$LT$T$C$F$GT$::get::h37bc82ca85a795cf	
regex_automata::meta::regex::Regex::search_half::haad5eaf00688f9e3	
regex::regex::string::Regex::is_match_at::hefc78212cd6a6e5e	
regex::regex::string::Regex::is_match::h978417c482a3b499	
_$LT$uaparser..parser..os..Matcher$u20$as$u20$uaparser..SubParser$GT$::try_parse::h3fac1011bb0e5792	
_$LT$uaparser..parser..UserAgentParser$u20$as$u20$uaparser..Parser$GT$::parse_os::_$u7b$$u7b$closure$u7d$$u7d$::h034f173e256d694c	
_$LT$core..slice..iter..Iter$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::find_map::h9db4b2cdc08ef062	
_$LT$uaparser..parser..UserAgentParser$u20$as$u20$uaparser..Parser$GT$::parse_os::h102f7f7facbe3882

What are the steps to reproduce the behavior?

use std::{thread, vec};

use once_cell::sync::Lazy;
use rand::distributions::Alphanumeric;
use rand::{thread_rng, Rng};
use uaparser::{Parser, UserAgentParser};

pub static USER_AGENT_PARSER: Lazy<UserAgentParser> =
    Lazy::new(|| UserAgentParser::from_bytes(include_bytes!("../regexes.yml")).expect("error constructing parser"));

fn main() {
    let uap = &USER_AGENT_PARSER;
    let mut handles = vec![];
    for _ in 0..10 {
        let handle = thread::spawn(move || {
            let mut android_count = 0;
            for _ in 0..100_000 {
                let rand_string: String = thread_rng()
                    .sample_iter(&Alphanumeric)
                    .take(30)
                    .map(char::from)
                    .collect();
                let os = uap.parse_os(&rand_string);
                if os.family == "Android" {
                    android_count += 1;
                }
            }
            println!("Android count: {}", android_count);
        });
        handles.push(handle);
    }
    for handle in handles {
        handle.join().unwrap();
    }
}

with dependencies:

[dependencies]
once_cell= "1.17.1"
uaparser = { version = "0.6.0", default-features = false }
rand = "0.8.5"
regex = "1.9"

and the regexes.yml file can be found here: https://gist.github.com/hamiltop/b010fb9e0189210796c526144b27bd99

What is the actual behavior?

On regex 1.9 that uses 375MB of memory on my machine.

What is the expected behavior?

On regex 1.8.4 that uses 175MB of memory on my machine.

The text was updated successfully, but these errors were encountered:

This fixes a memory usage regression where the backtracker would eagerly allocate its entire capacity up-front. In this case, it meant a minimum of 256KB for every regex. Prior to regex 1.9, 256KB was treated as a *maximum*, and we only allocated what we needed. We migrate that strategy to regex-automata now as well. This probably does come with a latency cost (I'll run rebar to be sure it isn't horrendous), but we definitely can't be eagerly allocating 256KB for every regex. If the latency ends up being an issue, we can investigate fixing that in other ways. Fixes #1027

BurntSushi · 2023-07-07T17:46:33Z

This is fixed in regex 1.9.1 on crates.io.

BurntSushi · 2023-07-07T17:48:05Z

Also, the uaparser crate doesn't seem to be compiling regexes in the most space efficient way possible. Just changing \d and \s to (?-u:\d) and (?-u:\s) got me a decent chunk of memory savings. Probably changing . to \p{ascii} would give you another bit of savings. (You could use (?-u:.), but that can match invalid UTF-8, so you'd have to switch to regex::bytes::Regex. Which is likely doable, but is probably also a bigger change. And if you do that, you might as well leave the regexes as they are just disable Unicode mode entirely via RegexBuilder.)

BurntSushi · 2023-07-07T17:49:42Z

48 hours to get the first regression report after releasing a complete rewrite of a regex engine. Not bad. Haha. I was expecting more sooner.

Thanks for the report. :-)

hamiltop · 2023-07-07T18:01:21Z

Thanks for the fast fix! I'm honored to be first.

Re: uaparser there's definitely some room for improvement there and I want to jump in and try to optimize it at some point. In addition to your recommendations, I want to give RegexSet a try. Per https://docs.rs/regex/latest/regex/struct.RegexSet.html#limitations it sounds like I would compile a RegexSet with all of them and then do the capture in a second pass with the one that matched. It will be an interesting experiment at least.

aengelas · 2023-07-07T18:01:25Z

Thank YOU for such a fast response (plus some bonus regex optimization tips)! I'm on the team with Peter that hit this issue, and it's pretty great to have a such a fast response. Appreciate all your work in the ecosystem.

BurntSushi · 2023-07-07T18:22:05Z

Thanks for the fast fix! I'm honored to be first.

Re: uaparser there's definitely some room for improvement there and I want to jump in and try to optimize it at some point. In addition to your recommendations, I want to give RegexSet a try. Per https://docs.rs/regex/latest/regex/struct.RegexSet.html#limitations it sounds like I would compile a RegexSet with all of them and then do the capture in a second pass with the one that matched. It will be an interesting experiment at least.

I'd recommend checking out meta::Regex in the new shiny regex-automata. It has native multi-pattern support everywhere, including with the captures. The one caveat is that RegexSet will do overlapping matches where as the meta regex won't. (At least, it won't by default.) So if you need to know if multiple regexes match instead of just the first, then yeah, probably best to stick with RegexSet I think for now.

Thanks for the kind words all. :-)

hamiltop · 2023-07-08T16:21:04Z

If you don't mind, I want to keep digging around on these optimizations. If anything, I wanted to capture the results from testing them all out for others to see.

Background context

ua-parser is a collection of user agent parsers across many platforms all built from the same yaml files. As the yaml file gets updated by the community you can easily pull in the updates. The benefit: we don't have to deal with the ever-changing world of user agents (it's a mess). The downside, we don't have as much control over these regexes.

Outside ua-parser, our specific use case for all of this is just OS detection on mobile devices. We send a link over sms and when they click the link we send them to the appropriate app store. We could do a very naive check for Android or iOS in the user agent, but I've dealt with user agents before and it's never that simple. Custom/skinned browsers on cheap android devices

Optimizations

And if you do that, you might as well leave the regexes as they are just disable Unicode mode entirely via RegexBuilder.

This seems not possible, as the docs say:

Note that if Unicode mode is disabled, then the regex will fail to compile if it could match invalid UTF-8. For example, when Unicode mode is disabled, then since . matches any byte (except for \n), then it can match invalid UTF-8 and thus building a regex from it will fail.

So it seems like we can't disable Unicode via RegexBuilder and leave the regexes as they are. (Indeed, it fails to compile the regex). We'd have to adjust the regexes in all cases (except if we used regex::bytes::Regex).

Data

I did some quick hacks on uaparser to test things out and got some early results on memory usage:

regex 1.9.1 - 110MB
regex::bytes::Regex - 25MB
regex::RegexSet - 2GB
regex::bytes::RegexSet - 600MB
using \p{ascii} instead of . - 45MB
using \p{ascii} and (?-u:\d) and (?-u:\s) - 31MB
regex::RegexSet with improved regexes - 730MB
regex_lite - 15MB but took more than 10x the time

Questions

Is there some other way we could disable unicode without changing the regexes? Or is regex::bytes::Regex the path here?
Is there an easy way to do the shift from . to \p{ascii} accurately and easily? Given the exposure available via meta::Regex that might be something we can do at a higher level than string search and replace.
Similar to the above: The regexes here are very capable and can capture specific OS versions. We don't actually care that much about those specifics. Is there a way to programmatically tweak the regex to remove unneeded capture groups and other complexity? I would love a single RegexSet for "iOS" and one for "Android", but memory usage when naively combining everything was just too high. Stripping out the extraneous parts might reduce that memory usage considerably.

BurntSushi · 2023-07-08T17:05:17Z

Is there some other way we could disable unicode without changing the regexes? Or is regex::bytes::Regex the path here?

Yes, regex::bytes::Regex is the right path here.

Technically an alternative path exists where you parse the regexes using regex-syntax and rewrite Unicode classes as ASCII classes. It would be a fair bit of work. Likely much more than using regex::bytes::Regex. I don't know exactly how you're using regexes, but it seems like using regex::bytes::Regex should be pretty simple? But I sense hesitancy from you.

Is there an easy way to do the shift from . to \p{ascii} accurately and easily? Given the exposure available via meta::Regex that might be something we can do at a higher level than string search and replace.

Yeah as I mentioned above you'd have to parse the regex with regex-syntax and then rewrite the Ast to use the thing you want. Then translate to Hir and build a meta::Regex from there. (Or convert the Hir back to a string and build a regex::bytes::Regex.)

Similar to the above: The regexes here are very capable and can capture specific OS versions. We don't actually care that much about those specifics. Is there a way to programmatically tweak the regex to remove unneeded capture groups and other complexity? I would love a single RegexSet for "iOS" and one for "Android", but memory usage when naively combining everything was just too high. Stripping out the extraneous parts might reduce that memory usage considerably.

Yeah again, the way to do this is on an Ast. Note that an Ast is complex.

With that said, I expect it won't help much for a RegexSet. The problem is that the techniques that regex-automata uses for multi-pattern regexes don't scale well. It's a problem with a rather large design space, but it's one I hope to tackle. But we're talking years here.

hamiltop · 2023-07-08T18:06:14Z

No major hesitancy on regex::bytes::Regex, just wanted to understand the full scope before heading down that path.

I'm also working through your recent blog post about the internals and tinkering with regex-cli, mostly just for my own edification. It's a great explanation and demystified a lot of the magic.

Much thanks for your time here.

BurntSushi · 2023-07-08T18:10:41Z

Awesome. Feel free to ask more questions! Even if they are tangential. Opening a new Discussion would be great. :)

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. - [PERF #787](rust-lang/regex#787), [PERF #891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

BurntSushi mentioned this issue Jul 7, 2023

automata/nfa/backtrack: fix memory usage #1028

Merged

BurntSushi closed this as completed in #1028 Jul 7, 2023

hamiltop mentioned this issue Jul 9, 2023

Add configurability for unicode and skipping parsers oceanlewis/uap-rs#14

Merged

teor2345 mentioned this issue Jul 10, 2023

build(deps): bump regex from 1.8.4 to 1.9.1 ZcashFoundation/zebra#7189

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased runtime memory usage in 1.9 #1027

Increased runtime memory usage in 1.9 #1027

hamiltop commented Jul 7, 2023

BurntSushi commented Jul 7, 2023

BurntSushi commented Jul 7, 2023 •

edited

Loading

BurntSushi commented Jul 7, 2023

hamiltop commented Jul 7, 2023

aengelas commented Jul 7, 2023

BurntSushi commented Jul 7, 2023

hamiltop commented Jul 8, 2023

BurntSushi commented Jul 8, 2023

hamiltop commented Jul 8, 2023

BurntSushi commented Jul 8, 2023

Increased runtime memory usage in 1.9 #1027

Increased runtime memory usage in 1.9 #1027

Comments

hamiltop commented Jul 7, 2023

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented Jul 7, 2023

BurntSushi commented Jul 7, 2023 • edited Loading

BurntSushi commented Jul 7, 2023

hamiltop commented Jul 7, 2023

aengelas commented Jul 7, 2023

BurntSushi commented Jul 7, 2023

hamiltop commented Jul 8, 2023

Background context

Optimizations

Data

Questions

BurntSushi commented Jul 8, 2023

hamiltop commented Jul 8, 2023

BurntSushi commented Jul 8, 2023

BurntSushi commented Jul 7, 2023 •

edited

Loading