Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeouts to cache lookups and handle server eofs #79

Merged
merged 3 commits into from
Mar 25, 2017

Conversation

alexcrichton
Copy link
Contributor

We've unfortunately been having a lot of trouble over at rust-lang/rust#40240 with sccache causing spurious failures. A number of logs are linked from that issue and to try to diagnose these issues I've enabled sccache logging which we cat to the Travis logs on failures.

Today we got a new failure whose failing build pointed to:

error: failed to execute compile
caused by: error reading compile response from server
caused by: IoError(Error { repr: Custom(Custom { kind: Other, error: StringError("unexpected EOF") }) })
caused by: unexpected EOF

Upon inspection of the sccache server debug logs (at the end of the build log) I noticed the failing object was ELFDumper.cpp.o, we attempted to fetch this object from the cache (I think), and no other messages about ELFDumper.cpp.o (success or not) happened on the server. My theory for what happened looks like:

  • We sent off an S3 request
  • This S3 request then takes forever, presumably an "infinite" amount of time
  • The sccache server decided to shut down due to being idle
  • sccache waited 10 seconds for the S3 connection to finish
  • It didn't, so we shut down the whole server, closing active connections
  • The client then receives EOF unexpectedly, causing the error

To handle this failure mode this PR makes a few modifications

  • The client now handles UnexpectedEof the same way it handles "unhandled compilation", it just compiles the object locally. This should protect against this in the future and other spurious races like this which are generally benign.
  • I've added logging to shutdown of the server just to know when it happens
  • I've added a timeout to cache lookups (60s currently) which should help avoiding waiting on S3 for too long.

@arielb1
Copy link

arielb1 commented Mar 13, 2017

Could you write something to stderr/somewhere else visible on buildbot when we get unexpected EOF errors, so we can have some monitoring over them?

@alexcrichton
Copy link
Contributor Author

@arielb1 yeah unfortunately architecturally sccache is slightly difficult to debug since it primarily runs as a daemon (without stderr/stdout). Note though that it supports redirection of stderr in the daemon which we're leveraging and printing.

(those logs are how I discovered this issue)

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Mar 16, 2017
I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's
see if that helps rust-lang#40240!
bors added a commit to rust-lang/rust that referenced this pull request Mar 16, 2017
travis: Update sccache binary

I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's
see if that helps #40240!
bors added a commit to rust-lang/rust that referenced this pull request Mar 17, 2017
travis: Update sccache binary

I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's
see if that helps #40240!
@alexcrichton
Copy link
Contributor Author

@luser FWIW I've deployed this to rust-lang/rust and it seems to have at least erased that particular failure mode!

@luser
Copy link
Contributor

luser commented Mar 22, 2017

@alexcrichton That's great to hear! I have been trying to get my Rust PR landable, I will finish that off so we can get everything else rebased and landed.

src/commands.rs Outdated
// get a response then we just forge ahead locally and run the
// compilation ourselves.
Err(ProtobufError::IoError(ref e))
if e.kind() == io::ErrorKind::UnexpectedEof => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a log message here so it's easy to figure out when this happens? Actually I think even straight printing something to stderr would be OK in this case, it feels like a bug if we hit this in practice.

Copy link
Contributor

@luser luser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, just a few little things that would be good. Feel free to rebase and land this.

@@ -355,6 +360,20 @@ impl Compiler {
storage.get(&key)
};

// Wait at most a minute for the cache to respond before we forge
// ahead ourselves with a compilation.
let timeout = Duration::new(60, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a followup, it might be nice to have a way to configure this so we can tweak it in practice.

@@ -575,6 +576,9 @@ impl<C> SccacheService<C>
stats.cache_misses += 1;
stats.forced_recaches += 1;
}
MissType::TimedOut => {
stats.cache_misses += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a stat for this as well so we can track it?

I believe this is possible to hit if the compilation takes quite a long time and
times out the server itself. The server may shut down due to being idle while
we're still connected and cause us to receive an `UnexpectedEof`.
This should help ensure that we don't wait *too* long for the cache to respond
(for example on an excessively slow network) and time out the server
unnecessarily.
@alexcrichton alexcrichton merged commit 9a3447f into mozilla:master Mar 25, 2017
@alexcrichton alexcrichton deleted the handle-eof branch March 25, 2017 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants