Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased runtime memory usage in 1.9 #1027

Closed
hamiltop opened this issue Jul 7, 2023 · 10 comments · Fixed by #1028
Closed

Increased runtime memory usage in 1.9 #1027

hamiltop opened this issue Jul 7, 2023 · 10 comments · Fixed by #1028

Comments

@hamiltop
Copy link

hamiltop commented Jul 7, 2023

What version of regex are you using?

1.9

Describe the bug at a high level.

The upgrade from 1.8.4 to 1.9 significantly increased runtime memory usage. (375MB vs 175MB)

Heap analysis suggests that this is due to regex_automata::nfa::thompson::backtrack::Cache::new::hc3a7922de221986c. (900 allocations at 256kb each, which a little more than the 200MB difference). Here's the full stack trace for those allocations:

alloc::alloc::Global::alloc_impl::hc0d2fbcc9e259c41	
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hff80cfdf399332ed	
alloc::raw_vec::finish_grow::h1d4f4e7469250062	
alloc::raw_vec::RawVec$LT$T$C$A$GT$::grow_amortized::hb0b597f5657dcfe2	
alloc::raw_vec::RawVec$LT$T$C$A$GT$::reserve::do_reserve_and_handle::hafd8f04576418cba	
alloc::vec::Vec$LT$T$C$A$GT$::reserve::h9ddd934bc4485398	
alloc::vec::Vec$LT$T$C$A$GT$::extend_with::h866264405059bb08	
alloc::vec::Vec$LT$T$C$A$GT$::resize::h22b6e20adc554f2b	
regex_automata::nfa::thompson::backtrack::Visited::reset::hcd3ecf9f7bc5e143	
regex_automata::nfa::thompson::backtrack::Visited::new::h28c46b03d50c9004	
regex_automata::nfa::thompson::backtrack::Cache::new::hc3a7922de221986c	
regex_automata::nfa::thompson::backtrack::BoundedBacktracker::create_cache::heb005f34fc005666	
regex_automata::meta::wrappers::BoundedBacktrackerCache::new::_$u7b$$u7b$closure$u7d$$u7d$::h66c74de55ee73b0e	
core::option::Option$LT$T$GT$::map::hbfd3f56f2a53617b	
regex_automata::meta::wrappers::BoundedBacktrackerCache::new::h6625b6730d32ff87	
regex_automata::meta::wrappers::BoundedBacktracker::create_cache::h2c635eefad95c3ac	
_$LT$regex_automata..meta..strategy..Core$u20$as$u20$regex_automata..meta..strategy..Strategy$GT$::create_cache::haf98298fa360833d	
regex_automata::meta::regex::Builder::build_many_from_hir::_$u7b$$u7b$closure$u7d$$u7d$::ha3b32a0a124f1e6c	
_$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..Fn$LT$Args$GT$$GT$::call::h2983e1b0fb6fda94	
regex_automata::util::pool::inner::Pool$LT$T$C$F$GT$::get_slow::hd0a3e0958d9d9787	
regex_automata::util::pool::inner::Pool$LT$T$C$F$GT$::get::hc70ab56136e5ccdb	
regex_automata::util::pool::Pool$LT$T$C$F$GT$::get::h37bc82ca85a795cf	
regex_automata::meta::regex::Regex::search_half::haad5eaf00688f9e3	
regex::regex::string::Regex::is_match_at::hefc78212cd6a6e5e	
regex::regex::string::Regex::is_match::h978417c482a3b499	
_$LT$uaparser..parser..os..Matcher$u20$as$u20$uaparser..SubParser$GT$::try_parse::h3fac1011bb0e5792	
_$LT$uaparser..parser..UserAgentParser$u20$as$u20$uaparser..Parser$GT$::parse_os::_$u7b$$u7b$closure$u7d$$u7d$::h034f173e256d694c	
_$LT$core..slice..iter..Iter$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::find_map::h9db4b2cdc08ef062	
_$LT$uaparser..parser..UserAgentParser$u20$as$u20$uaparser..Parser$GT$::parse_os::h102f7f7facbe3882	

What are the steps to reproduce the behavior?

use std::{thread, vec};

use once_cell::sync::Lazy;
use rand::distributions::Alphanumeric;
use rand::{thread_rng, Rng};
use uaparser::{Parser, UserAgentParser};

pub static USER_AGENT_PARSER: Lazy<UserAgentParser> =
    Lazy::new(|| UserAgentParser::from_bytes(include_bytes!("../regexes.yml")).expect("error constructing parser"));

fn main() {
    let uap = &USER_AGENT_PARSER;
    let mut handles = vec![];
    for _ in 0..10 {
        let handle = thread::spawn(move || {
            let mut android_count = 0;
            for _ in 0..100_000 {
                let rand_string: String = thread_rng()
                    .sample_iter(&Alphanumeric)
                    .take(30)
                    .map(char::from)
                    .collect();
                let os = uap.parse_os(&rand_string);
                if os.family == "Android" {
                    android_count += 1;
                }
            }
            println!("Android count: {}", android_count);
        });
        handles.push(handle);
    }
    for handle in handles {
        handle.join().unwrap();
    }
}

with dependencies:

[dependencies]
once_cell= "1.17.1"
uaparser = { version = "0.6.0", default-features = false }
rand = "0.8.5"
regex = "1.9"

and the regexes.yml file can be found here: https://gist.github.com/hamiltop/b010fb9e0189210796c526144b27bd99

What is the actual behavior?

On regex 1.9 that uses 375MB of memory on my machine.

What is the expected behavior?

On regex 1.8.4 that uses 175MB of memory on my machine.

BurntSushi added a commit that referenced this issue Jul 7, 2023
This fixes a memory usage regression where the backtracker would eagerly
allocate its entire capacity up-front. In this case, it meant a minimum
of 256KB for every regex.

Prior to regex 1.9, 256KB was treated as a *maximum*, and we only
allocated what we needed. We migrate that strategy to regex-automata now
as well. This probably does come with a latency cost (I'll run rebar to
be sure it isn't horrendous), but we definitely can't be eagerly
allocating 256KB for every regex. If the latency ends up being an issue,
we can investigate fixing that in other ways.

Fixes #1027
BurntSushi added a commit that referenced this issue Jul 7, 2023
This fixes a memory usage regression where the backtracker would eagerly
allocate its entire capacity up-front. In this case, it meant a minimum
of 256KB for every regex.

Prior to regex 1.9, 256KB was treated as a *maximum*, and we only
allocated what we needed. We migrate that strategy to regex-automata now
as well. This probably does come with a latency cost (I'll run rebar to
be sure it isn't horrendous), but we definitely can't be eagerly
allocating 256KB for every regex. If the latency ends up being an issue,
we can investigate fixing that in other ways.

Fixes #1027
@BurntSushi
Copy link
Member

This is fixed in regex 1.9.1 on crates.io.

@BurntSushi
Copy link
Member

BurntSushi commented Jul 7, 2023

Also, the uaparser crate doesn't seem to be compiling regexes in the most space efficient way possible. Just changing \d and \s to (?-u:\d) and (?-u:\s) got me a decent chunk of memory savings. Probably changing . to \p{ascii} would give you another bit of savings. (You could use (?-u:.), but that can match invalid UTF-8, so you'd have to switch to regex::bytes::Regex. Which is likely doable, but is probably also a bigger change. And if you do that, you might as well leave the regexes as they are just disable Unicode mode entirely via RegexBuilder.)

@BurntSushi
Copy link
Member

48 hours to get the first regression report after releasing a complete rewrite of a regex engine. Not bad. Haha. I was expecting more sooner.

Thanks for the report. :-)

@hamiltop
Copy link
Author

hamiltop commented Jul 7, 2023

Thanks for the fast fix! I'm honored to be first.

Re: uaparser there's definitely some room for improvement there and I want to jump in and try to optimize it at some point. In addition to your recommendations, I want to give RegexSet a try. Per https://docs.rs/regex/latest/regex/struct.RegexSet.html#limitations it sounds like I would compile a RegexSet with all of them and then do the capture in a second pass with the one that matched. It will be an interesting experiment at least.

@aengelas
Copy link

aengelas commented Jul 7, 2023

Thank YOU for such a fast response (plus some bonus regex optimization tips)! I'm on the team with Peter that hit this issue, and it's pretty great to have a such a fast response. Appreciate all your work in the ecosystem.

@BurntSushi
Copy link
Member

Thanks for the fast fix! I'm honored to be first.

Re: uaparser there's definitely some room for improvement there and I want to jump in and try to optimize it at some point. In addition to your recommendations, I want to give RegexSet a try. Per https://docs.rs/regex/latest/regex/struct.RegexSet.html#limitations it sounds like I would compile a RegexSet with all of them and then do the capture in a second pass with the one that matched. It will be an interesting experiment at least.

I'd recommend checking out meta::Regex in the new shiny regex-automata. It has native multi-pattern support everywhere, including with the captures. The one caveat is that RegexSet will do overlapping matches where as the meta regex won't. (At least, it won't by default.) So if you need to know if multiple regexes match instead of just the first, then yeah, probably best to stick with RegexSet I think for now.

Thanks for the kind words all. :-)

@hamiltop
Copy link
Author

hamiltop commented Jul 8, 2023

If you don't mind, I want to keep digging around on these optimizations. If anything, I wanted to capture the results from testing them all out for others to see.

Background context

ua-parser is a collection of user agent parsers across many platforms all built from the same yaml files. As the yaml file gets updated by the community you can easily pull in the updates. The benefit: we don't have to deal with the ever-changing world of user agents (it's a mess). The downside, we don't have as much control over these regexes.

Outside ua-parser, our specific use case for all of this is just OS detection on mobile devices. We send a link over sms and when they click the link we send them to the appropriate app store. We could do a very naive check for Android or iOS in the user agent, but I've dealt with user agents before and it's never that simple. Custom/skinned browsers on cheap android devices

Optimizations

And if you do that, you might as well leave the regexes as they are just disable Unicode mode entirely via RegexBuilder.

This seems not possible, as the docs say:

Note that if Unicode mode is disabled, then the regex will fail to compile if it could match invalid UTF-8. For example, when Unicode mode is disabled, then since . matches any byte (except for \n), then it can match invalid UTF-8 and thus building a regex from it will fail.

So it seems like we can't disable Unicode via RegexBuilder and leave the regexes as they are. (Indeed, it fails to compile the regex). We'd have to adjust the regexes in all cases (except if we used regex::bytes::Regex).

Data

I did some quick hacks on uaparser to test things out and got some early results on memory usage:

  1. regex 1.9.1 - 110MB
  2. regex::bytes::Regex - 25MB
  3. regex::RegexSet - 2GB
  4. regex::bytes::RegexSet - 600MB
  5. using \p{ascii} instead of . - 45MB
  6. using \p{ascii} and (?-u:\d) and (?-u:\s) - 31MB
  7. regex::RegexSet with improved regexes - 730MB
  8. regex_lite - 15MB but took more than 10x the time

Questions

  1. Is there some other way we could disable unicode without changing the regexes? Or is regex::bytes::Regex the path here?
  2. Is there an easy way to do the shift from . to \p{ascii} accurately and easily? Given the exposure available via meta::Regex that might be something we can do at a higher level than string search and replace.
  3. Similar to the above: The regexes here are very capable and can capture specific OS versions. We don't actually care that much about those specifics. Is there a way to programmatically tweak the regex to remove unneeded capture groups and other complexity? I would love a single RegexSet for "iOS" and one for "Android", but memory usage when naively combining everything was just too high. Stripping out the extraneous parts might reduce that memory usage considerably.

@BurntSushi
Copy link
Member

Is there some other way we could disable unicode without changing the regexes? Or is regex::bytes::Regex the path here?

Yes, regex::bytes::Regex is the right path here.

Technically an alternative path exists where you parse the regexes using regex-syntax and rewrite Unicode classes as ASCII classes. It would be a fair bit of work. Likely much more than using regex::bytes::Regex. I don't know exactly how you're using regexes, but it seems like using regex::bytes::Regex should be pretty simple? But I sense hesitancy from you.

Is there an easy way to do the shift from . to \p{ascii} accurately and easily? Given the exposure available via meta::Regex that might be something we can do at a higher level than string search and replace.

Yeah as I mentioned above you'd have to parse the regex with regex-syntax and then rewrite the Ast to use the thing you want. Then translate to Hir and build a meta::Regex from there. (Or convert the Hir back to a string and build a regex::bytes::Regex.)

Similar to the above: The regexes here are very capable and can capture specific OS versions. We don't actually care that much about those specifics. Is there a way to programmatically tweak the regex to remove unneeded capture groups and other complexity? I would love a single RegexSet for "iOS" and one for "Android", but memory usage when naively combining everything was just too high. Stripping out the extraneous parts might reduce that memory usage considerably.

Yeah again, the way to do this is on an Ast. Note that an Ast is complex.

With that said, I expect it won't help much for a RegexSet. The problem is that the techniques that regex-automata uses for multi-pattern regexes don't scale well. It's a problem with a rather large design space, but it's one I hope to tackle. But we're talking years here.

@hamiltop
Copy link
Author

hamiltop commented Jul 8, 2023

No major hesitancy on regex::bytes::Regex, just wanted to understand the full scope before heading down that path.

I'm also working through your recent blog post about the internals and tinkering with regex-cli, mostly just for my own edification. It's a great explanation and demystified a lot of the magic.

Much thanks for your time here.

@BurntSushi
Copy link
Member

Awesome. Feel free to ask more questions! Even if they are tangential. Opening a new Discussion would be great. :)

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants