Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Downloader: No retry after receiving a CacheNotFoundException #377

Closed
mmikitka opened this issue Nov 23, 2020 · 11 comments
Closed

Remote Downloader: No retry after receiving a CacheNotFoundException #377

mmikitka opened this issue Nov 23, 2020 · 11 comments

Comments

@mmikitka
Copy link
Contributor

Note: I originally submitted this ticket to bazelbuild/bazel and was told to redirect it here. See bazelbuild/bazel#12417 for details.

Description of the problem / feature request:

Using --experimental_remote_downloader with a https://github.com/buchgr/bazel-remote remote cache, the build failed with the following error, upon encountering a CacheNotFoundException.

I expected Bazel to fallback to fetching the artifact in the event of a cache miss.

INFO: Repository build_bazel_rules_nodejs instantiated at:
  no stack (--record_rule_instantiation_callstack not enabled)
Repository rule http_archive defined at:
  /home/gitlab/.cache/bazel/_bazel_gitlab/856f582f3053cc7df99ab29b906d93f4/external/bazel_tools/tools/build_defs/repo/http.bzl:336:31: in <toplevel>
ERROR: An error occurred during the fetch of repository 'build_bazel_rules_nodejs':
   com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 0f2de53628e848c1691e5729b515022f5a77369c76a09fbe55611e12731c90e3/1000820
ERROR: no such package '@build_bazel_rules_nodejs//': com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 0f2de53628e848c1691e5729b515022f5a77369c76a09fbe55611e12731c90e3/1000820
INFO: Elapsed time: 35.390s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
ERROR: Build failed. Not running target
FAILED: Build did NOT complete successfully (0 packages loaded)
ERROR: Job failed: command terminated with exit code 1

Feature requests: what underlying problem are you trying to solve with this feature?

More reliable builds.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unsure.

What operating system are you running Bazel on?

Linux Ubuntu 18.04

What's the output of bazel info release?

release 3.4.1

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

N/A

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

N/A

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

Here are all the remote-related configs:

common:ci-remote-cache --remote_cache=grpc://bazel-remote-cache.bazel-remote-cache.svc.cluster.local:9092
common:ci-remote-cache --experimental_remote_downloader=grpc://bazel-remote-cache.bazel-remote-cache.svc.cluster.local:9092
common:ci-remote-cache --remote_local_fallback
common:ci-remote-cache --remote_max_connections=50
common:ci-remote-cache --remote_retries=1
common:ci-remote-cache --remote_timeout=60
common:ci-remote-cache --experimental_guard_against_concurrent_changes
@mostynb
Copy link
Collaborator

mostynb commented Nov 23, 2020

Thanks for the bug report- could you please share the function/rule (not sure of the terminology here) for this item in your WORKSPACE file? Is it a http_archive, or something else?

It's strange that bazel doesn't attempt to download this resource itself after getting a cache miss- I suspect that is something that should be fixed on the bazel side.

@dhalperi
Copy link

FYI, I was having this problem a ton but then I upgraded to a version manually built after #318 and it has not yet recurred.

@dhalperi
Copy link

(It would be nice if the docker containers were pushed)

@mostynb
Copy link
Collaborator

mostynb commented Nov 23, 2020

The bazel log mentions 0f2de53628e848c1691e5729b515022f5a77369c76a09fbe55611e12731c90e3, so I dug around and found this WORKSPACE entry from the rules_nodejs 2.0.1 release:

http_archive(
    name = "build_bazel_rules_nodejs",
    sha256 = "0f2de53628e848c1691e5729b515022f5a77369c76a09fbe55611e12731c90e3",
    urls = ["https://github.com/bazelbuild/rules_nodejs/releases/download/2.0.1/
rules_nodejs-2.0.1.tar.gz"],
)

If I add that to my WORKSAPCE, then run bazel-remote (from the tip of the master branch) like so:
./bazel-remote --dir testcache --max_size=10 --experimental_remote_asset_api

And build a bazel project with --remote_cache=grpc://127.0.0.1:9092 --experimental_remote_downloader=grpc://127.0.0.1:9092

Then I can see the following in the bazel-remote logs:

2020/11/23 19:03:08 GRPC ASSET FETCH https://github.com/bazelbuild/rules_nodejs/releases/download/2.0.1/rules_nodejs-2.0.1.tar.gz 200 OK
2020/11/23 19:03:09 GRPC BYTESTREAM READ COMPLETED blobs/0f2de53628e848c1691e5729b515022f5a77369c76a09fbe55611e12731c90e3/1000820

Which seems to be working as expected.

@dhalperi
Copy link

For me it was nondeterministic and seemed to depend on how many concurrent bazel workers there were. Worked most of the time, so I would expect a simple test to not have any issues.

@mostynb
Copy link
Collaborator

mostynb commented Nov 23, 2020

For me it was nondeterministic and seemed to depend on how many concurrent bazel workers there were. Worked most of the time, so I would expect a simple test to not have any issues.

@dhalperi: that definitely sounds like something #323 would have fixed.

I forgot that I hadn't made a new release since that landed. Tagged the tip of master as v1.3.0, and I will try to get the image on dockerhub updated (it might take a few days). You can track the progress in #378.

@mostynb
Copy link
Collaborator

mostynb commented Nov 23, 2020

@mmikitka: could you try the v1.3.0 release (you may need to build it locally) and report back if this is still a problem?

@mmikitka
Copy link
Contributor Author

Thank you for the feedback @mostynb and @dhalperi . I just upgraded to v1.3.0 (manually built) and I notify you of any re-occurence.

Note that, like @dhalperi , I did not see the error consistently, so it could be load-related as suggested.

@mostynb
Copy link
Collaborator

mostynb commented Nov 24, 2020

The dockerhub image has been updated.

I'm not sure how frequently you were seeing this problem, but let's wait a week or so before closing this issue.

@dhalperi
Copy link

dhalperi commented Dec 2, 2020

I've had no issues (for almost 3 weeks) along these lines after upgrading buchgr/bazel-remote past the version with concurrency fixes.

@mostynb
Copy link
Collaborator

mostynb commented Dec 2, 2020

Thanks for following up.

@mostynb mostynb closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants