Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address #10134 OOM/timeout #10173

Closed
wants to merge 6 commits into from
Closed

Conversation

smoelius
Copy link
Contributor

@smoelius smoelius commented Jan 7, 2023

This is an attempt to address the OOM/timeout exhibited by @mwkmwkmwk's example in #10134.

As I see it, the reasons for the OOM/timeout are the following:

  • Her example involved a large number of basic blocks.
  • A program's number of locals tends to be proportional to its number of basic blocks (as was the case in her example).
  • The work done by possible_borrower at each join in iterate_to_fixpoint (essentially, at each basic block) was proportional to the number of locals.

One way out of this situation is to make the work done at each join not proportional to the number of locals. The current PR proposes a data structure that achieves this for certain usage patterns, e.g., certain MIR dataflow problems. In particular, it avoids an OOM/timeout on @mwkmwkmwk's example and does not seem to degrade performance on Clippy's existing tests.

The data structure supports an in-place union operation similar to BitSet's union. However, unlike BitSet's union, which returns true only when new bits were added, this data structure's union will sometimes delay computation and return true immediately even when no new elements were added. Delayed computations are represented in a tree-like data structure, and when certain conditions are met, trees are flattened into actual sets.

Additional details are provided in lazy_set's module level documentation.

r? @Jarcho

(@xFrednet I copied your edits to #10144's changelog, but changed the PR number.)


changelog: Enhancement: [redundant_clone]: Improved performance and memory usage
#10173

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Jan 7, 2023
@xFrednet
Copy link
Member

xFrednet commented Jan 7, 2023

Thank you, the changelog looks good to me :D

@mwkmwkmwk
Copy link

This helps with my synthetic minimized example indeed, but doesn't help at all with my actual project

I'll see if I can create a more representative minimized example that OOMs with this branch; or, if you'd prefer to look yourself, just run cargo clippy in https://github.com/prjunnamed/prjcombine

@Jarcho
Copy link
Contributor

Jarcho commented Jan 8, 2023

The problem is caused by the domain of the analysis being stored per basic block. With the current implementation this results in the memory usage being nm^2 where n is the number of locals and m is the number of basic blocks. This will explode for large functions. 1000 locals with 1000 basic blocks is not particularly large and results in memory usage of 128MB just for the bitsets.

A fix would be to use a copy-on-write bitset. A hybrid between a small sparse bitset and the full COW bitset would probably be best.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 8, 2023

With the current implementation this results in the memory usage being nm^2 where n is the number of locals and m is the number of basic blocks.

Not mn^2? If nm^2 is what you meant, could you explain how you arrived at that?

A fix would be to use a copy-on-write bitset. A hybrid between a small sparse bitset and the full COW bitset would probably be best.

So, to be more precise, you mean instead of FxIndexMap<Local, BitSet<Local>> (which is used now), something like FxIndexMap<Local, Cow<BitSet<Local>>>?

@Jarcho
Copy link
Contributor

Jarcho commented Jan 8, 2023

Not mn^2? If nm^2 is what you meant, could you explain how you arrived at that?

Yes. That should have been mn^2. Doesn't really matter either way as both would be unreasonably large.

So, to be more precise, you mean instead of FxIndexMap<Local, BitSet<Local>> (which is used now), something like FxIndexMap<Local, Cow<BitSet<Local>>>?

Something like that, yes. Each basic block is unlikely to produce many borrows, so this will significantly cut down the number of cloned bitsets.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 8, 2023

Something like that, yes. Each basic block is unlikely to produce many borrows, so this will significantly cut down the number of cloned bitsets.

I'll give it a shot.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 8, 2023

In the branch from #10144, I pushed what I think you were describing @Jarcho.

This approach helps with the memory, but seemingly not with the time required.

On my laptop, @mwkmwkmwk's first example takes about 10 seconds. For her second example, I kill it after a couple of minutes.

Part of the problem with her first example is that are are many tiny borrower sets. So the loop that is currently in join (i.e., for (&borrowed, borrowers) in other.map.iter()) hurts us.

That is why I think we need an approach where join's work does not depend on the number of locals, at least not for the majority of joins.

There could be something obvious that I am not seeing, though.

@Jarcho
Copy link
Contributor

Jarcho commented Jan 10, 2023

You could try optimizing a join on the bottom value into a clone instead of looping over every value. bottom.join(pred) is called at least once per block.

Might be worth reference counting the map itself. Going by the MIR output of the second example less than half the blocks result in a borrow.

Using IndexVec instead of FxHashMap might help, but it might also make things worse. If you keep the vec sized to the highest numbered local borrowed it should improve things.

Reference counting only the dense bit set might also provide some performance benefit, but it could also make things worse.


The second example is bordering on unreasonable. Clippy should still run quickly on it, but the single function takes over two seconds for me to compile. Adding one more quad! macro outside the loop raises that to forty seconds, Two more puts it up to 21 minutes and several GB of memory.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 11, 2023

These sound like good suggestions, but I am afraid they wouldn't address the time required by @mwkmwkmwk's first example.

What if PossibleBorrower included some sort of a "work limit" and returned "inconclusive" when it was/would be exceeded?

PossibleBorrower could then either:

  • Estimate whether the work limit would be exceeded before analyzing a function body.
  • Maintain a running tally and bail when the work limit is reached (something similar to what I did here).

What do you think of either of these ideas?

(EDIT: I'm considering the memory issue essentially resolved by your suggestions here.)

@smoelius
Copy link
Contributor Author

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

@bors
Copy link
Collaborator

bors commented Jan 12, 2023

☔ The latest upstream changes (presumably #10192) made this pull request unmergeable. Please resolve the merge conflicts.

@smoelius
Copy link
Contributor Author

Close this?

(FWIW, I think reverting #9701 was the right decision.)

@Jarcho
Copy link
Contributor

Jarcho commented Jan 12, 2023

If it can be made to work it's worth doing.

@smoelius
Copy link
Contributor Author

One other idea I had was this:

/// 2. In all current uses of `at_most_borrowers`, `borrowers` is a slice of at most two
/// elements. Thus, `borrowers.contains(...)` is effectively a constant-time operation. If

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

The downside would be it could restrict possible_borrower's applicability to future lints, e.g., ones that really need to know the full set of borrowers.

What do you think?

@Jarcho
Copy link
Contributor

Jarcho commented Jan 13, 2023

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

I don't see how that could work. Just because a type has only two borrowers at a single point in time doesn't mean it didn't have hundreds of borrowers earlier. We still need to keep track of all known borrowers to know when something would drop down to only a specific number.

You could limit the locals tracked to only the one we care about. This would mean rerunning the analysis for each local we need to check, but it might end up being faster.

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

I'd rather not take this approach given the current approach does handle processing these cases.

@smoelius
Copy link
Contributor Author

We could require the caller to provide an upper bound on the number of borrowers (e.g., 2), and incorporate this into the analysis. This would effectively take the complexity from nm^2 to nm.

I don't see how that could work. Just because a type has only two borrowers at a single point in time doesn't mean it didn't have hundreds of borrowers earlier. We still need to keep track of all known borrowers to know when something would drop down to only a specific number.

Sorry for not being clear. What I meant was: the state for each local would hold up to say two borrowers, but there would be an additional representation to mean "three or more." Something like:

struct PossibleBorrowerState {
    map: FxIndexMap<Local, Option<FxHashSet<Local>>>, // each set's size is at most `max_borrowers`,
                                                      // `None` means `max_borrowers` was exceeded
    max_borrowers: usize,
}

Joining "three or more" with anything else results in another "three or more."

At any given program point, when we ask, "Are the borrowers of X at most Y and Z?" if X's state is "three or more," the answer is immediately "no."

(Come to think of it, if a future lint did need the full set of borrowers, it could just set the limit exorbitantly high.)

You could limit the locals tracked to only the one we care about. This would mean rerunning the analysis for each local we need to check, but it might end up being faster.

I'm hoping we don't have to go this route.

To make the idea concrete, I implemented the first bullet and pushed it to the branch from #10144.

I'd rather not take this approach given the current approach does handle processing these cases.

👍

@Jarcho
Copy link
Contributor

Jarcho commented Jan 13, 2023

That's how interpreted your suggestion. It misses any of the more interesting cases where something was at one borrowed, and now no longer is. Since borrowers are currently not removed once added this would break quite a few cases.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 13, 2023

It misses any of the more interesting cases where something was at one borrowed, and now no longer is. Since borrowers are currently not removed once added this would break quite a few cases.

I'm not following. I don't think the code did this before or after #9701. In both cases, the borrower sets only grew.

What am I missing?

EDIT: I will try to implement it. Either it will work, or I will find my mistake.

@Jarcho
Copy link
Contributor

Jarcho commented Jan 13, 2023

The borrower set is trimmed down afterwards with MaybeStorageLive so only live borrows are taken into account.

@smoelius
Copy link
Contributor Author

The borrower set is trimmed down afterwards with MaybeStorageLive so only live borrows are taken into account.

I see what you are saying now.

@Jarcho
Copy link
Contributor

Jarcho commented Jan 13, 2023

It might be better to look into getting the borrow checker usable from clippy rather than rewriting a limited version of it.

@smoelius
Copy link
Contributor Author

Do you have any thoughts on how one would go about that, or what the result would look like?

@Jarcho
Copy link
Contributor

Jarcho commented Jan 14, 2023

The borrow checker starts at rustc_borrowck::do_mir_borrowck. Looks like the rustc_borrowck::dataflow::Borrows pass has the results that we want. Basically none of the borrow checker has a public interface, so one would need to be built.

@smoelius
Copy link
Contributor Author

@Jarcho I could use your input.

I have been looking into how we might incorporate Borrows into Clippy, and I am running into trouble with lifetimes.

In particular, a Borrows holds a reference to a Body and a BorrowSet. Thus, it's not clear how one would store all three structures in a LintPass structure.

One way out would be to change Borrows so that it held an Rc to a Body and a BorrowSet. But that feels a little invasive toward rustc.

Another way out I think would be to use pinning. (I haven't played with pinning before, so I'm speculating a bit.) But I think that would also require unsafe code.

I can't find any unsafe code in Clippy currently. So a question I have is: is adding unsafe code a "red line" not to be crossed in Clippy?

Alternatively, do you think Rcs are the right way to go?

Or is there another option I haven't thought of?

@Jarcho
Copy link
Contributor

Jarcho commented Jan 21, 2023

Body is already bound by the TyCtxt anyways, so that one isn't a problem. The BorrowSet is already held behind a Rc in borrowck in the first place, so it's probably fine if Borrows changes to hold it as one

Have you checked if Borrows contains what we need? I only looked at it briefly.

@smoelius
Copy link
Contributor Author

smoelius commented Jan 21, 2023

Have you checked if Borrows contains what we need? I only looked at it briefly.

I'm still trying to figure that out. I was coding up what I thought might work when I ran into this problem.

I will reply again when I know more.

@smoelius
Copy link
Contributor Author

smoelius commented Feb 6, 2023

@Jarcho The code I just pushed does not compile---it requires changes to rustc. Those changes can be found in this branch. Before I open a PR to the Rust repo, could I get a preliminary, "yes, this stands a chance of being merged into Clippy?" (There is no rush.)

This code reconstructs the MIR the borrow checker is run on, and from that reconstructs some artifacts the borrow checker would have produced, notably:

These artifacts are used most crucially here (e.g., if there were a bug that would cause this approach to fall apart, I would expect it to be that code).

As alluded to above, the borrower checker is run on an earlier version of the MIR than the one returned by optimized_mir. Several of the changes in that rustc branch are to allow Clippy to reconstruct the earlier MIR.

I tried running the borrower checker on the optimized_mir MIR, but it didn't seem to work. For example, this span_bug fires.

It would seem unreasonable to expect every possible_borrower client to use the earlier MIR. So I wrote code to try to translate locals and locations from optimized_mir MIR to the earlier MIR. That code is currently in clippy_utils/src/mir.rs.

I think that hits the high points.

@Jarcho
Copy link
Contributor

Jarcho commented Feb 6, 2023

I'll try to look at that this week.

@Jarcho
Copy link
Contributor

Jarcho commented Feb 9, 2023

Given that this requires rebuilding the MIR, I'd rather not merge this as is. I think the better approach here is to add proper MIR lint passes into rustc (or something resembling them). That is, however, a rather large project which would require coordinating with the compiler team.

I'm not totally against the idea, but I would rather investigate alternatives first. I'll bring this up at the next clippy meeting in two weeks. Anyways, thanks for looking into this.


Back to the initial problem, you could try merging the PossibleBorrowers and StorageLive passes. This should result in a significantly smaller number of tracked borrowers at any given point in time, especially as the analysis gets further into the function.

The basic idea here is that once the local's storage is dead there can't be any live borrows, so we can stop keeping track of them. Since the majority of borrowed locals will have a rather narrow live range this should result in significantly less work on each join, and possibly also less joins in general (loops are pretty much always run over at least twice now).

@smoelius
Copy link
Contributor Author

smoelius commented Feb 9, 2023

Given that this requires rebuilding the MIR, I'd rather not merge this as is. I think the better approach here is to add proper MIR lint passes into rustc (or something resembling them).

You probably know this already, but there's already something like this: https://doc.rust-lang.org/beta/nightly-rustc/rustc_mir_transform/pass_manager/trait.MirLint.html

I worry that using an additional lint pass in Clippy could create churn though. For example, an idea I was considering for the future was to use the borrow check results to identify unnecessary lifetime annotations (e.g., &'tcx Body<'tcx> where the annotation need not be on the reference). On the face of it, this sounds like an enhancement of needless_lifetimes. So (rhetorically) would/should needless_lifetimes then be converted to a MIR lint pass? It's something to consider.

Another idea to avoid reconstruction could be to add a flag to the compiler configuration to say "preserve the borrower check results" and then expose those structures through a query.

I'm not totally against the idea, but I would rather investigate alternatives first. I'll bring this up at the next clippy meeting in two weeks. Anyways, thanks for looking into this.

👍

Back to the initial problem, you could try merging the PossibleBorrowers and StorageLive passes. ...

This idea makes total sense. But if it's okay, I'd like to wait and see how we will proceed using the borrow check results before continuing with the existing PossibleBorrowers implementation.

@bors
Copy link
Collaborator

bors commented Feb 10, 2023

☔ The latest upstream changes (presumably #10313) made this pull request unmergeable. Please resolve the merge conflicts.

@Jarcho Jarcho added the I-nominated Issue: Nominated to be discussed at the next Clippy meeting label Feb 21, 2023
@Jarcho
Copy link
Contributor

Jarcho commented Feb 21, 2023

Result of the meeting: Assuming the compiler team is ok with the rustc changes and there aren't any perf issues that come up, this approach is ok.

@Jarcho Jarcho removed the I-nominated Issue: Nominated to be discussed at the next Clippy meeting label Feb 21, 2023
@smoelius
Copy link
Contributor Author

Thanks, @Jarcho. I will try to push the compiler changes through.

@Jarcho
Copy link
Contributor

Jarcho commented Feb 23, 2023

Did some quick perf testing. Rebuilding and borrow checking every function in clippy_lints:

Full build Single function changed
cargo check 18.20s 6.79s
baseline clippy 28.55 14.28s
rebuilding mir 31.54s 18.28s

Worst case of 25% slower should be fine. Given that this will get better when redundant_clone and needless_borrow aren't running the old code and it doesn't run on literally every function.

This doesn't include any impact from the rustc changes, so this could end up being worse.

@smoelius
Copy link
Contributor Author

Thanks, @Jarcho! I hope it wasn't to onerous to run these tests.

@smoelius
Copy link
Contributor Author

@Jarcho This is still a WIP. No action is needed at this time.

@smoelius
Copy link
Contributor Author

smoelius commented Feb 7, 2024

I'm going to close this. I may revisit this at some point in the future.

@smoelius smoelius closed this Feb 7, 2024
@smoelius smoelius deleted the lazy_set branch August 18, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants