Address #10134 OOM/timeout #10173

smoelius · 2023-01-07T10:21:37Z

This is an attempt to address the OOM/timeout exhibited by @mwkmwkmwk's example in #10134.

As I see it, the reasons for the OOM/timeout are the following:

Her example involved a large number of basic blocks.
A program's number of locals tends to be proportional to its number of basic blocks (as was the case in her example).
The work done by possible_borrower at each join in iterate_to_fixpoint (essentially, at each basic block) was proportional to the number of locals.

One way out of this situation is to make the work done at each join not proportional to the number of locals. The current PR proposes a data structure that achieves this for certain usage patterns, e.g., certain MIR dataflow problems. In particular, it avoids an OOM/timeout on @mwkmwkmwk's example and does not seem to degrade performance on Clippy's existing tests.

The data structure supports an in-place union operation similar to BitSet's union. However, unlike BitSet's union, which returns true only when new bits were added, this data structure's union will sometimes delay computation and return true immediately even when no new elements were added. Delayed computations are represented in a tree-like data structure, and when certain conditions are met, trees are flattened into actual sets.

Additional details are provided in lazy_set's module level documentation.

r? @Jarcho

(@xFrednet I copied your edits to #10144's changelog, but changed the PR number.)

changelog: Enhancement: [redundant_clone]: Improved performance and memory usage
#10173

xFrednet · 2023-01-07T10:38:52Z

Thank you, the changelog looks good to me :D

mwkmwkmwk · 2023-01-07T12:49:32Z

This helps with my synthetic minimized example indeed, but doesn't help at all with my actual project

I'll see if I can create a more representative minimized example that OOMs with this branch; or, if you'd prefer to look yourself, just run cargo clippy in https://github.com/prjunnamed/prjcombine

Jarcho · 2023-01-08T00:30:37Z

The problem is caused by the domain of the analysis being stored per basic block. With the current implementation this results in the memory usage being nm^2 where n is the number of locals and m is the number of basic blocks. This will explode for large functions. 1000 locals with 1000 basic blocks is not particularly large and results in memory usage of 128MB just for the bitsets.

A fix would be to use a copy-on-write bitset. A hybrid between a small sparse bitset and the full COW bitset would probably be best.

smoelius · 2023-01-08T01:01:02Z

With the current implementation this results in the memory usage being nm^2 where n is the number of locals and m is the number of basic blocks.

Not mn^2? If nm^2 is what you meant, could you explain how you arrived at that?

A fix would be to use a copy-on-write bitset. A hybrid between a small sparse bitset and the full COW bitset would probably be best.

So, to be more precise, you mean instead of FxIndexMap<Local, BitSet<Local>> (which is used now), something like FxIndexMap<Local, Cow<BitSet<Local>>>?

Jarcho · 2023-01-08T01:10:18Z

Not mn^2? If nm^2 is what you meant, could you explain how you arrived at that?

Yes. That should have been mn^2. Doesn't really matter either way as both would be unreasonably large.

So, to be more precise, you mean instead of FxIndexMap<Local, BitSet<Local>> (which is used now), something like FxIndexMap<Local, Cow<BitSet<Local>>>?

Something like that, yes. Each basic block is unlikely to produce many borrows, so this will significantly cut down the number of cloned bitsets.

smoelius · 2023-01-08T01:50:17Z

Something like that, yes. Each basic block is unlikely to produce many borrows, so this will significantly cut down the number of cloned bitsets.

I'll give it a shot.

smoelius · 2023-01-08T12:49:17Z

In the branch from #10144, I pushed what I think you were describing @Jarcho.

This approach helps with the memory, but seemingly not with the time required.

On my laptop, @mwkmwkmwk's first example takes about 10 seconds. For her second example, I kill it after a couple of minutes.

Part of the problem with her first example is that are are many tiny borrower sets. So the loop that is currently in join (i.e., for (&borrowed, borrowers) in other.map.iter()) hurts us.

That is why I think we need an approach where join's work does not depend on the number of locals, at least not for the majority of joins.

There could be something obvious that I am not seeing, though.

Jarcho · 2023-01-10T05:52:59Z

You could try optimizing a join on the bottom value into a clone instead of looping over every value. bottom.join(pred) is called at least once per block.

Might be worth reference counting the map itself. Going by the MIR output of the second example less than half the blocks result in a borrow.

Using IndexVec instead of FxHashMap might help, but it might also make things worse. If you keep the vec sized to the highest numbered local borrowed it should improve things.

Reference counting only the dense bit set might also provide some performance benefit, but it could also make things worse.

The second example is bordering on unreasonable. Clippy should still run quickly on it, but the single function takes over two seconds for me to compile. Adding one more quad! macro outside the loop raises that to forty seconds, Two more puts it up to 21 minutes and several GB of memory.

smoelius · 2023-01-11T12:31:55Z

These sound like good suggestions, but I am afraid they wouldn't address the time required by @mwkmwkmwk's first example.

What if PossibleBorrower included some sort of a "work limit" and returned "inconclusive" when it was/would be exceeded?

PossibleBorrower could then either:

Estimate whether the work limit would be exceeded before analyzing a function body.
Maintain a running tally and bail when the work limit is reached (something similar to what I did here).

What do you think of either of these ideas?

(EDIT: I'm considering the memory issue essentially resolved by your suggestions here.)

smoelius · 2023-01-12T00:22:55Z

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

bors · 2023-01-12T19:02:12Z

☔ The latest upstream changes (presumably #10192) made this pull request unmergeable. Please resolve the merge conflicts.

smoelius · 2023-01-12T19:35:06Z

Close this?

(FWIW, I think reverting #9701 was the right decision.)

Jarcho · 2023-01-12T20:58:21Z

If it can be made to work it's worth doing.

smoelius · 2023-01-12T22:16:26Z

One other idea I had was this:

rust-clippy/clippy_utils/src/mir/possible_borrower.rs

Lines 255 to 256 in 4fe3727

    
           /// 2. In all current uses of `at_most_borrowers`, `borrowers` is a slice of at most two 
        
           /// elements. Thus, `borrowers.contains(...)` is effectively a constant-time operation. If

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

The downside would be it could restrict possible_borrower's applicability to future lints, e.g., ones that really need to know the full set of borrowers.

What do you think?

Jarcho · 2023-01-13T17:23:18Z

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

I don't see how that could work. Just because a type has only two borrowers at a single point in time doesn't mean it didn't have hundreds of borrowers earlier. We still need to keep track of all known borrowers to know when something would drop down to only a specific number.

You could limit the locals tracked to only the one we care about. This would mean rerunning the analysis for each local we need to check, but it might end up being faster.

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

I'd rather not take this approach given the current approach does handle processing these cases.

smoelius · 2023-01-13T18:19:41Z

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

I don't see how that could work. Just because a type has only two borrowers at a single point in time doesn't mean it didn't have hundreds of borrowers earlier. We still need to keep track of all known borrowers to know when something would drop down to only a specific number.

Sorry for not being clear. What I meant was: the state for each local would hold up to say two borrowers, but there would be an additional representation to mean "three or more." Something like:

struct PossibleBorrowerState {
    map: FxIndexMap<Local, Option<FxHashSet<Local>>>, // each set's size is at most `max_borrowers`,
                                                      // `None` means `max_borrowers` was exceeded
    max_borrowers: usize,
}

Joining "three or more" with anything else results in another "three or more."

At any given program point, when we ask, "Are the borrowers of X at most Y and Z?" if X's state is "three or more," the answer is immediately "no."

(Come to think of it, if a future lint did need the full set of borrowers, it could just set the limit exorbitantly high.)

You could limit the locals tracked to only the one we care about. This would mean rerunning the analysis for each local we need to check, but it might end up being faster.

I'm hoping we don't have to go this route.

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

I'd rather not take this approach given the current approach does handle processing these cases.

👍

Jarcho · 2023-01-13T18:27:05Z

That's how interpreted your suggestion. It misses any of the more interesting cases where something was at one borrowed, and now no longer is. Since borrowers are currently not removed once added this would break quite a few cases.

smoelius · 2023-01-13T18:45:54Z

It misses any of the more interesting cases where something was at one borrowed, and now no longer is. Since borrowers are currently not removed once added this would break quite a few cases.

I'm not following. I don't think the code did this before or after #9701. In both cases, the borrower sets only grew.

What am I missing?

EDIT: I will try to implement it. Either it will work, or I will find my mistake.

Jarcho · 2023-01-13T19:01:10Z

The borrower set is trimmed down afterwards with MaybeStorageLive so only live borrows are taken into account.

smoelius · 2023-01-13T19:04:45Z

The borrower set is trimmed down afterwards with MaybeStorageLive so only live borrows are taken into account.

I see what you are saying now.

Jarcho · 2023-01-13T19:06:42Z

It might be better to look into getting the borrow checker usable from clippy rather than rewriting a limited version of it.

smoelius · 2023-01-13T19:09:28Z

Do you have any thoughts on how one would go about that, or what the result would look like?

Jarcho · 2023-01-14T15:39:25Z

The borrow checker starts at rustc_borrowck::do_mir_borrowck. Looks like the rustc_borrowck::dataflow::Borrows pass has the results that we want. Basically none of the borrow checker has a public interface, so one would need to be built.

smoelius · 2023-01-21T01:31:15Z

@Jarcho I could use your input.

I have been looking into how we might incorporate Borrows into Clippy, and I am running into trouble with lifetimes.

In particular, a Borrows holds a reference to a Body and a BorrowSet. Thus, it's not clear how one would store all three structures in a LintPass structure.

One way out would be to change Borrows so that it held an Rc to a Body and a BorrowSet. But that feels a little invasive toward rustc.

Another way out I think would be to use pinning. (I haven't played with pinning before, so I'm speculating a bit.) But I think that would also require unsafe code.

I can't find any unsafe code in Clippy currently. So a question I have is: is adding unsafe code a "red line" not to be crossed in Clippy?

Alternatively, do you think Rcs are the right way to go?

Or is there another option I haven't thought of?

Jarcho · 2023-01-21T03:20:59Z

Body is already bound by the TyCtxt anyways, so that one isn't a problem. The BorrowSet is already held behind a Rc in borrowck in the first place, so it's probably fine if Borrows changes to hold it as one

Have you checked if Borrows contains what we need? I only looked at it briefly.

smoelius · 2023-01-21T10:42:10Z

Have you checked if Borrows contains what we need? I only looked at it briefly.

I'm still trying to figure that out. I was coding up what I thought might work when I ran into this problem.

I will reply again when I know more.

smoelius · 2023-02-06T00:31:08Z

@Jarcho The code I just pushed does not compile---it requires changes to rustc. Those changes can be found in this branch. Before I open a PR to the Rust repo, could I get a preliminary, "yes, this stands a chance of being merged into Clippy?" (There is no rush.)

This code reconstructs the MIR the borrow checker is run on, and from that reconstructs some artifacts the borrow checker would have produced, notably:

the MIR with region variables (returned by renumber_mir)
the BorrowSet, which can tell you all the borrows of a local
the RegionInferenceContext, which can tell you whether one region outlives another

These artifacts are used most crucially here (e.g., if there were a bug that would cause this approach to fall apart, I would expect it to be that code).

As alluded to above, the borrower checker is run on an earlier version of the MIR than the one returned by optimized_mir. Several of the changes in that rustc branch are to allow Clippy to reconstruct the earlier MIR.

I tried running the borrower checker on the optimized_mir MIR, but it didn't seem to work. For example, this span_bug fires.

It would seem unreasonable to expect every possible_borrower client to use the earlier MIR. So I wrote code to try to translate locals and locations from optimized_mir MIR to the earlier MIR. That code is currently in clippy_utils/src/mir.rs.

I think that hits the high points.

Jarcho · 2023-02-06T04:05:12Z

I'll try to look at that this week.

Jarcho · 2023-02-09T03:55:36Z

Given that this requires rebuilding the MIR, I'd rather not merge this as is. I think the better approach here is to add proper MIR lint passes into rustc (or something resembling them). That is, however, a rather large project which would require coordinating with the compiler team.

I'm not totally against the idea, but I would rather investigate alternatives first. I'll bring this up at the next clippy meeting in two weeks. Anyways, thanks for looking into this.

Back to the initial problem, you could try merging the PossibleBorrowers and StorageLive passes. This should result in a significantly smaller number of tracked borrowers at any given point in time, especially as the analysis gets further into the function.

The basic idea here is that once the local's storage is dead there can't be any live borrows, so we can stop keeping track of them. Since the majority of borrowed locals will have a rather narrow live range this should result in significantly less work on each join, and possibly also less joins in general (loops are pretty much always run over at least twice now).

smoelius · 2023-02-09T10:36:23Z

Given that this requires rebuilding the MIR, I'd rather not merge this as is. I think the better approach here is to add proper MIR lint passes into rustc (or something resembling them).

You probably know this already, but there's already something like this: https://doc.rust-lang.org/beta/nightly-rustc/rustc_mir_transform/pass_manager/trait.MirLint.html

I worry that using an additional lint pass in Clippy could create churn though. For example, an idea I was considering for the future was to use the borrow check results to identify unnecessary lifetime annotations (e.g., &'tcx Body<'tcx> where the annotation need not be on the reference). On the face of it, this sounds like an enhancement of needless_lifetimes. So (rhetorically) would/should needless_lifetimes then be converted to a MIR lint pass? It's something to consider.

Another idea to avoid reconstruction could be to add a flag to the compiler configuration to say "preserve the borrower check results" and then expose those structures through a query.

I'm not totally against the idea, but I would rather investigate alternatives first. I'll bring this up at the next clippy meeting in two weeks. Anyways, thanks for looking into this.

👍

Back to the initial problem, you could try merging the PossibleBorrowers and StorageLive passes. ...

This idea makes total sense. But if it's okay, I'd like to wait and see how we will proceed using the borrow check results before continuing with the existing PossibleBorrowers implementation.

bors · 2023-02-10T10:56:15Z

☔ The latest upstream changes (presumably #10313) made this pull request unmergeable. Please resolve the merge conflicts.

Jarcho · 2023-02-21T17:01:40Z

Result of the meeting: Assuming the compiler team is ok with the rustc changes and there aren't any perf issues that come up, this approach is ok.

smoelius · 2023-02-21T22:34:07Z

Thanks, @Jarcho. I will try to push the compiler changes through.

Jarcho · 2023-02-23T15:03:18Z

Did some quick perf testing. Rebuilding and borrow checking every function in clippy_lints:

	Full build	Single function changed
cargo check	18.20s	6.79s
baseline clippy	28.55	14.28s
rebuilding mir	31.54s	18.28s

Worst case of 25% slower should be fine. Given that this will get better when redundant_clone and needless_borrow aren't running the old code and it doesn't run on literally every function.

This doesn't include any impact from the rustc changes, so this could end up being worse.

smoelius · 2023-02-23T15:36:36Z

Thanks, @Jarcho! I hope it wasn't to onerous to run these tests.

By overriding the `mir_borrowck` query

smoelius · 2023-06-30T23:23:53Z

@Jarcho This is still a WIP. No action is needed at this time.

smoelius · 2024-02-07T11:24:32Z

I'm going to close this. I may revisit this at some point in the future.

rustbot assigned Jarcho Jan 7, 2023

rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Jan 7, 2023

mwkmwkmwk mentioned this pull request Jan 7, 2023

Huge CPU and RAM usage in redundant_clone/possible_borrower on long functions #10134

Open

smoelius marked this pull request as draft January 7, 2023 14:31

smoelius force-pushed the lazy_set branch from 4bdd15f to e8880c1 Compare February 6, 2023 00:28

Jarcho added the I-nominated Issue: Nominated to be discussed at the next Clippy meeting label Feb 21, 2023

Jarcho removed the I-nominated Issue: Nominated to be discussed at the next Clippy meeting label Feb 21, 2023

smoelius mentioned this pull request Feb 22, 2023

Make borrow check results available to Clippy rust-lang/rust#108328

Closed

smoelius added 6 commits June 14, 2023 04:49

Add tests

df50398

Improve possible_borrower

688afff

Address review comments

85e0f69

Add tests from rust-lang#10134

c3dcb14

Implement PossibleBorrowerMap using the borrowck crate

7f8e2c8

Trying to implement a "borrowck" pass

031f7df

By overriding the `mir_borrowck` query

smoelius force-pushed the lazy_set branch from e8880c1 to 031f7df Compare June 30, 2023 23:23

smoelius closed this Feb 7, 2024

smoelius deleted the lazy_set branch August 18, 2024 16:58

Address #10134 OOM/timeout #10173

Address #10134 OOM/timeout #10173

Conversation

smoelius commented Jan 7, 2023 • edited Loading

xFrednet commented Jan 7, 2023

mwkmwkmwk commented Jan 7, 2023

Jarcho commented Jan 8, 2023

smoelius commented Jan 8, 2023

Jarcho commented Jan 8, 2023

smoelius commented Jan 8, 2023

smoelius commented Jan 8, 2023

Jarcho commented Jan 10, 2023

smoelius commented Jan 11, 2023 • edited Loading

smoelius commented Jan 12, 2023

bors commented Jan 12, 2023

smoelius commented Jan 12, 2023

Jarcho commented Jan 12, 2023

smoelius commented Jan 12, 2023

Jarcho commented Jan 13, 2023 • edited Loading

smoelius commented Jan 13, 2023

Jarcho commented Jan 13, 2023

smoelius commented Jan 13, 2023 • edited Loading

Jarcho commented Jan 13, 2023

smoelius commented Jan 13, 2023

Jarcho commented Jan 13, 2023

smoelius commented Jan 13, 2023

Jarcho commented Jan 14, 2023

smoelius commented Jan 21, 2023

Jarcho commented Jan 21, 2023

smoelius commented Jan 21, 2023 • edited Loading

smoelius commented Feb 6, 2023

Jarcho commented Feb 6, 2023

Jarcho commented Feb 9, 2023 • edited Loading

smoelius commented Feb 9, 2023

bors commented Feb 10, 2023

Jarcho commented Feb 21, 2023

smoelius commented Feb 21, 2023

Jarcho commented Feb 23, 2023

smoelius commented Feb 23, 2023

smoelius commented Jun 30, 2023

smoelius commented Feb 7, 2024

smoelius commented Jan 7, 2023 •

edited

Loading

smoelius commented Jan 11, 2023 •

edited

Loading

Jarcho commented Jan 13, 2023 •

edited

Loading

smoelius commented Jan 13, 2023 •

edited

Loading

smoelius commented Jan 21, 2023 •

edited

Loading

Jarcho commented Feb 9, 2023 •

edited

Loading