Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid buffering large amounts of rustc output. #7838

Merged
merged 4 commits into from
Mar 12, 2020

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented Jan 26, 2020

If rustc prints out a lot of information (such as with RUSTC_LOG, or a huge number of diagnostics), cargo would buffer up large amounts of that in memory. For normal builds, this would happen if the terminal does not print fast enough. For "fresh" replay, everything was being buffered.

There are two issues:

  1. There is no back-pressure on the mpsc queue. If messages come in faster than they can be processed, it grows without bounds.
  2. The cache-replay code runs in the "fresh" code path which does not spawn a thread. Thus the main thread was blocked and unable to process Messages while the replay is happening.

The solution here is to use a bounded queue, and to always spawn a thread for the "fresh" case.

The main concern here is performance. Previously the "fresh" jobs avoided spawning a thread to improve performance. I did a fair bit of profiling to understand the impact, using projects with anywhere from 100 to 500 units. On my macOS machine, I found spawning a thread to be slightly faster (1-5%). On Linux and Windows, it was generally about 0 to 5% slower. It might be helpful for others to profile it on their own system.

I'm on the fence for the cost/benefit here. It seems generally good to reduce memory usage, but the slight performance hit is disappointing. I tried several other approaches to fix this, all with worse trade offs (I can discuss them if interested).

Fixes #6197

@rust-highfive
Copy link

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 26, 2020
@Eh2406
Copy link
Contributor

Eh2406 commented Jan 27, 2020

The code looks ok to r+ when we are done with the discussion.

@alexcrichton
Copy link
Member

Agreed that the concern here is perf, and my "benchmark" is to check out servo, do cargo build, and then see how long it takes to do that again. I did some quick testing locally though and also haven't been able to see much of a difference.

I think this is a case where the difference in perf, if any, is going to be pretty small (as measured by @ehuss). I also think it's best to be "more correct" in terms of not blocking the main thread and not hogging tons of memory.

Before I r+, though, I wanted to clarify something. When we buffer huge amounts of output here, is Cargo actually going to print all the output? Or is cargo buffering it in one location and later deciding to not print it? (if this is the case this seems like a better bug to fix, but if we're actually destined to print everything we read then we're just optimizing here what we're already printing).

@Mark-Simulacrum
Copy link
Member

I would like to see a comment on the sync_channel pointing at this PR, at least.

Did you choose 100 as the bound for the queue length for some reason, or just "some number"? I'm wondering if that will fail to work well with the jobserver-per-rustc flag, as in that scenario on a ~16 core machine we would expect 16*15 = 240 token requests to come in from all the rustc processes fairly quickly (and keep coming in, during the first few seconds of the build). For that case, we actually don't need to do anything with those requests (most will be ~immediately dropped on the floor, as by the time we get to them the process is already done I imagine), but if this length limit to 100 causes us to stall out and miss "Finished" events that could seriously slow Cargo down.

(To be clear, I don't think fixing the above is necessary, it's an unstable flag for a reason -- but wanted to dump my thoughts somewhere at least :)

@alexcrichton
Copy link
Member

As we continue to scrutinize the channels in Cargo I'm becoming a bit more wary of making sends blocking. I think we may want to do a quick audit as well where cargo sends messages to see if it's ok to block. The areas I can see are:

  • All methods on JobState may block now. This notably includes the stdout printing and such (the whole point of this PR). I think this is mostly ok but we may experience some timing issues depending on when timestamps get written and such. For example rustc may finish but it may take us quite some time to later touch a file to make a timestamp. I don't think this is a problem, but figured it'd be worth mentioning.
  • Sending tokens from the jobserver helper thread to the main thread. @ehuss this is like the interaction you were seeing although I would have expected deadlock rather than blocking for just awhile. In any case this is pretty sketchy because the jobserver helper thread needs to be shut down but it's blocking here where it didn't expect to be blocking. (note that the signals are intended to interrupt the blocking read call, if one is active)
  • Handling diagnostics which I think is basically the same bug as the jobserver helper thread all over again. We want to terminate the thread eventually but doing so may be problematic if it's blocking where we didn't expect it to block.

I suppose though the "maybe issues" are in practice not ever going to arise because we in theory should never start shut down until the whole message queue has been drained.

@alexcrichton
Copy link
Member

Reading more of @Mark-Simulacrum's comment as well, I think it's actually a pretty good point. I'm wondering now if it might be best to have a more surgical fix here where we rate-limit stdout information getting printed but not rate limit other more high-priority messages. For example everything about jobserver management is a pretty high-priority message (or anything related to scheduling) whereas printing things is informational and can happen whenever.

We could perhaps consider a fix where there's a fixed capacity, above which print messages block the sender, but that's it. All other messages (such as scheduling things) are unconditionally sent and never block

@Mark-Simulacrum
Copy link
Member

One similar option perhaps is to try and move all stderr/stdout printing to a separate thread. AFAICT, it doesn't interact with the scheduling pretty much at all. It also seems like 90% of the problem comes from the fact that currently all Fresh jobs (whose output is on disk) are loading it into memory and sending it over the channel. Can we instead make the Message event have two variants, one of which we'd thread down as deeply as possible and then stream from disk to stderr/out? Ideally that would avoid most buffering, whereas today I believe some buffering is sort of unavoidable (i.e. a single message could be 20 megabytes for larger crates).

@ehuss
Copy link
Contributor Author

ehuss commented Jan 29, 2020

When we buffer huge amounts of output here, is Cargo actually going to print all the output? Or is cargo buffering it in one location and later deciding to not print it?

It is going to be printed.

Did you choose 100 as the bound for the queue length for some reason, or just "some number"?

100 is pretty arbitrary. Messages can be large (multiple kB), so I figured keeping up to a few megabytes in memory seemed like a good limit.

Your concerns about large numbers of token messages sounds reasonable.

more surgical fix here where we rate-limit stdout information getting printed but not rate limit other more high-priority messages

This sounds good to me. I actually started with a different design where I had two separate queues, one for stdout/stderr, and one for everything else. But it ended up being quite a bit more complex. (I also had a branch where Shell is in a mutex and I removed Message::Stdout/Stderr, but it made the progress bar flicker too much.)

I'll try to take some time and digest your comments. I think you're both right, and this should probably have a different solution.

@bors
Copy link
Contributor

bors commented Jan 30, 2020

☔ The latest upstream changes (presumably #7844) made this pull request unmergeable. Please resolve the merge conflicts.

@ehuss ehuss force-pushed the fix-memory-rustc-output branch from e3f5032 to 9839826 Compare March 6, 2020 02:27
@ehuss
Copy link
Contributor Author

ehuss commented Mar 6, 2020

I pushed a different approach using two channels.

This change is somewhat risky, since there are some really subtle behaviors here. I've tried to think of all that could go wrong, and haven't come up with anything, yet. All other solutions I've thought of tend to be more complicated and riskier.

Only one change that I can think of: The message "build failed, waiting for other jobs to finish..." can now be printed in-between messages, where previously it would be printed after the faulty job finished. I'm not sure how likely that is, or whether it really matters.

@alexcrichton
Copy link
Member

Once we get into the realm of multiple channels I agree it's pretty hairy. Could we stick with one channel though? We could implement our own simple channel which is just a wrapper around Arc<Mutex<Vec<T>>> and we have two methods, one which always pushes and one which waits to push until the list is under a certain amount?

I think that would ideally help keep the concurrency here pretty simple since it's still just one queue of channels going out.

@ehuss
Copy link
Contributor Author

ehuss commented Mar 6, 2020

You mean #7845? I'd be much happier with that.

@alexcrichton
Copy link
Member

Effectively, yeah, I don't think it's worth trying to bend over backwards to use crates.io for a simple channel here, especially if it's adding a lot of complexity in thinking about the concurrency here.

alexcrichton and others added 2 commits March 7, 2020 14:01
We don't need the complexity of most channels since this is not a
performance sensitive part of Cargo, nor is it likely to be so any time
soon. Coupled with recent bugs (rust-lang#7840) we believe in `std::sync::mpsc`,
let's just not use that and use a custom queue type locally which should
be amenable to a blocking push soon too.
@ehuss ehuss force-pushed the fix-memory-rustc-output branch from 9839826 to e2b28f7 Compare March 8, 2020 18:25
@ehuss
Copy link
Contributor Author

ehuss commented Mar 8, 2020

I have pushed a new approach that uses #7845 instead. I'm still not sure how I feel about it. I can't think of specific problems. I ran a variety of performance tests, and it was roughly the same.

}
scope.spawn(move |_| doit());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change may no longer be necessary, but did you want to include it anyway here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary, otherwise the cached message playback would deadlock if there were more than 100 messages. The playback shouldn't happen on the main thread, otherwise there is nothing to drain messages while they are added to the queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right yeah, forgot about that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for message caching to check for deadlock.

/// Pushes an item onto the queue, blocking if the queue is full.
pub fn push_bounded(&self, item: T) {
let mut state = self.state.lock().unwrap();
loop {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be able to make use of the nifty wait_until method:

let state = self.bounded_cv.wait_until(state, |s| s.items.len() < self.bound).unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know that existed!

// Assumes threads cannot be canceled.
self.bounded_cv.notify_one();
}
Some(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might actually also get cleaned up a good amount with wait_timeout_until

let (mut state, result) = self.popper_cv.wait_timeout_until(
    self.state.lock().unwrap(),
    timeout,
    |s| s.items.len() > 0,
).unwrap();
if result.timed_out() {
    None 
} else {
    // conditionally notify `bounded_cv`
    state.items.pop_front()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, after thinking about it some more, this subtly changes the semantics. If there are multiple poppers, and both are awoken, then one will get a value and the other won't. We don't use multiple poppers, but for the push_bounded case, it could result in pushing too many elements on the queue. To guard against that, we would need to keep the loops, which ends up not simplifying at all.

In general, it probably doesn't matter, but I would prefer to keep the current semantics with the loop that "retries" after the thread is awakened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm not sure I follow, because if the closure returns true then that lock is persisted and returned, so we can't have two poppers simultaneously exit the wait timeout loop I believe? I think this is the same for the push case as well, where when we get a lock back after wait_until we're guaranteed that the condition evaluates true for the lock state we were returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Somehow it didn't click that it was atomically locked.

Pushed a commit with the change. Since it is unstable until 1.42, it will need to wait until Thursday.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops sorry about that, I totally didn't realize it was still unsable... In any case at least Thursday isn't that far off!

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Mar 12, 2020

📌 Commit 05a1f43 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 12, 2020
@bors
Copy link
Contributor

bors commented Mar 12, 2020

⌛ Testing commit 05a1f43 with merge 2bdc879...

@bors
Copy link
Contributor

bors commented Mar 12, 2020

☀️ Test successful - checks-azure
Approved by: alexcrichton
Pushing 2bdc879 to master...

@bors bors merged commit 2bdc879 into rust-lang:master Mar 12, 2020
bors added a commit to rust-lang/rust that referenced this pull request Mar 18, 2020
Update cargo

Update cargo

21 commits in bda50510d1daf6e9c53ad6ccf603da6e0fa8103f..7019b3ed3d539db7429d10a343b69be8c426b576
2020-03-02 18:05:34 +0000 to 2020-03-17 21:02:00 +0000
- Run through clippy (rust-lang/cargo#8015)
- Fix config profiles using "dev" in `cargo test`. (rust-lang/cargo#8012)
- Run CI on all PRs. (rust-lang/cargo#8011)
- Add unit-graph JSON output. (rust-lang/cargo#7977)
- Split workspace/validate() into multiple functions (rust-lang/cargo#8008)
- Use Option::as_deref (rust-lang/cargo#8005)
- De-duplicate edges (rust-lang/cargo#7993)
- Revert "Disable preserving mtimes on archives" (rust-lang/cargo#7935)
- Close the front door for clippy but open the back (rust-lang/cargo#7533)
- Fix CHANGELOG.md typos (rust-lang/cargo#7999)
- Update changelog note about crate-versions flag. (rust-lang/cargo#7998)
- Bump to 0.45.0, update changelog (rust-lang/cargo#7997)
- Bump libgit2 dependencies (rust-lang/cargo#7996)
- Avoid buffering large amounts of rustc output. (rust-lang/cargo#7838)
- Add "Updating" status for git submodules. (rust-lang/cargo#7989)
- WorkspaceResolve: Use descriptive lifetime label. (rust-lang/cargo#7990)
- Support old html anchors in manifest chapter. (rust-lang/cargo#7983)
- Don't create hardlink for library test and integrations tests, fixing rust-lang/cargo#7960 (rust-lang/cargo#7965)
- Partially revert change to filter debug_assertions. (rust-lang/cargo#7970)
- Try to better handle restricted crate names. (rust-lang/cargo#7959)
- Fix bug with new feature resolver and required-features. (rust-lang/cargo#7962)
@ehuss ehuss added this to the 1.44.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cargo keeps entire rustc/rustdoc stdout in memory
6 participants