Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silently fails to write GCS cache #1384

Closed
GMNGeoffrey opened this issue Nov 8, 2022 · 6 comments · Fixed by #1519
Closed

Silently fails to write GCS cache #1384

GMNGeoffrey opened this issue Nov 8, 2022 · 6 comments · Fixed by #1519

Comments

@GMNGeoffrey
Copy link
Contributor

If sccache is unable to authenticate to GCS it just silently proceeds without authentication. If it gets a "unauthorized" response when writing, it proceeds. Instead, I think several of these things should be an error, with the option to configure it to bypass the errors. I found #1180, which I think covers some of these issues, but there are others that should

Here are some logs demonstrating the issue: https://gist.github.com/GMNGeoffrey/b0f7453349b8211b3af355bc9a9765f9

[2022-11-08T22:53:15Z DEBUG sccache::cache::cache] Trying GCS bucket(iree-github-actions-presubmit-artifacts, gcmn-test-sccache, Some("/usr/local/google/home/gcmn/.config/gcloud/application_default_credentials.json"), None, None, ReadWrite)
[2022-11-08T22:53:15Z WARN  sccache::cache::cache] Failed to parse service account credentials from file: missing field `private_key` at line 6 column 1. Continuing without authentication.

This should be an error. I think just changing

"Failed to parse service account credentials from file: {:?}. \
here.

[2022-11-08T22:53:16Z DEBUG reqwest::async_impl::client] response '404 Not Found' for https://www.googleapis.com/download/storage/v1/b/iree-github-actions-presubmit-artifacts/o/gcmn-test-sccache%2F03e462fba45b71c78c638ecafbbf2a1e233ab6b79214350310269d3593f26987?alt=media
[2022-11-08T22:53:16Z WARN  sccache::cache::gcs] Got GCS error: didn't get a successful HTTP status, got `404 Not Found`

This should be DEBUG. A 404 is just a cache miss. #1180 covers this.

[2022-11-08T22:56:53Z DEBUG sccache::cache::cache] Trying GCS bucket(iree-github-actions-presubmit-artifacts, gcmn-test-sccache, None, None, None, ReadWrite)
[2022-11-08T22:56:53Z WARN  sccache::cache::cache] No SCCACHE_GCS_KEY_PATH specified-- no authentication will be used.

Should this be an error? No authentication with ReadWrite, seems exceptionally unlikely to work... I wonder if any cloud bucket is configured for unauthenticated write access

[2022-11-08T22:56:53Z DEBUG reqwest::async_impl::client] response '401 Unauthorized' for https://www.googleapis.com/upload/storage/v1/b/iree-github-actions-presubmit-artifacts/o?name=gcmn-test-sccache/03e462fba45b71c78c638ecafbbf2a1e233ab6b79214350310269d3593f26987&uploadType=media
[2022-11-08T22:56:53Z DEBUG sccache::compiler::compiler] [Demangle.cpp.o]: Cache write error: failed to put cache entry in GCS

This should be an error.

I am happy to send patches for these issues.

@drahnr
Copy link
Collaborator

drahnr commented Nov 9, 2022

I think failing to write itself should not be an error, there might be intermittent sporadic issues that should not cause a build to abort. Failing to authorize is a different ballpark though, and if happening repeatedly should be an error. PRs welcome, but it should be somewhat generalizable. Not sure on the heuristic to use for aborting.

Note that this is a breaking change (changes user observable behavior).

@GMNGeoffrey
Copy link
Contributor Author

Yeah I didn't mean to say that every write failure should be an error, just that entirely failing to ever write because it can't authenticate should be. I think I'd always abort on an unauthorized response. It might also be good to keep track of other errors and somehow detect high error rates, but that's a much more nuanced and large undertaking. So, I'd propose:

  1. Always abort on 401 or 403
  2. Abort when failing to parse auth
  3. Abort when put in ReadWrite mode with no auth, maybe with some configuration to override this (similar to how the default is to abort if the client can't communicate with the local server).
  4. Maybe try to do a write when the server is first configured? This is a bit tricky, but failing fast on not having access seems useful.

WDYT?

@sylvestre
Copy link
Collaborator

sounds good. anyway, it doesn't have to be perfect at first!

@GMNGeoffrey
Copy link
Contributor Author

Sounds good. I'll figure out how to build Rust and take a look at contributing :-)

@GMNGeoffrey
Copy link
Contributor Author

It actually looks like this issue is more widespread. Failing to create an Azure cache is similarly a warning:

Err(e) => warn!("Failed to create Azure cache: {:?}", e),

The cache creation function doesn't appear to be configured to return an error. It can log an error, but that doesn't actually halt the server from starting. Should that return a Result instead? (Sorry, I've never used Rust before)

@Xuanwo
Copy link
Collaborator

Xuanwo commented Dec 13, 2022

Here are some notes from OpenDAL's side (since s3/azure/gcs have been migrated to opendal).

Service that has been migrated to opendal will return opendal::Error. To address this issue, we can check the kind returned by error like the following:

match err.kind() {}
    ErrorKind::ObjectNotFound => ...,
    _ => ...
}

Writing a health_check file into the cache also seems a good idea to check if our config works as expected. But we need to take read-only cases into consideration.

By the way, opendal provides a LoggingLayer which can print informative logs for every operation.

use anyhow::Result;
use opendal::layers::LoggingLayer;
use opendal::Operator;
use opendal::Scheme;

let _ = Operator::from_env(Scheme::Fs)
    .expect("must init")
    .layer(LoggingLayer::default());

Logs will be like the following:

[2022-12-13T06:29:52Z WARN  opendal::services] service=s3 operation=stat path=x/x/y -> errored: ObjectNotFound (permanent) at stat
    
    Context:
        response: Parts { status: 404, version: HTTP/1.1, headers: {"x-amz-request-id": "HRX2A7QGVQ7X6EXY", "x-amz-id-2": "y7DKQ9PhRq3jFyWlH6/oReVQXL0bO+HAyicycw1H6qqZHzW/TAb2yHqeTwKXAxA4B+azRJ2mgKE=", "content-type": "applicati***/xml", "date": "Tue, 13 Dec 2022 06:29:52 GMT", "server": "AmazonS3"} }
        service: s3
        path: x/x/y

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants