Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only deduplicate currently uploading task, rather than all uploaded task when upload unified #531

Open
Ruoye-W opened this issue Feb 4, 2024 · 3 comments

Comments

@Ruoye-W
Copy link

Ruoye-W commented Feb 4, 2024

The current implementation of unified file uploading makes an assumption: the default cache cluster always contains previously uploaded files. Even if a subsequent compilation task calls FindMissingBlobs and discovers that a file is missing, when uploading the file, it still directly returns the previous upload result from the global cache based on the cached results recorded in casUploaders, without actually performing the upload. However, when the cache service within a specific Kubernetes (k8s) pod crashes and restarts, the local cache inside that pod gets cleared, rendering this assumption invalid. In such cases, it is necessary for us to deduplicate only the tasks that are currently in the file uploading state, rather than deduplicating all previously uploaded tasks.

@Ruoye-W Ruoye-W changed the title Only deduplicate currently uploading task, not all uploaded task when upload unified Only deduplicate currently uploading task, rather than all uploaded task when upload unified Feb 4, 2024
@mrahs
Copy link
Collaborator

mrahs commented Feb 22, 2024

I don't quite understand the issue you're describing. Can you clarify the relationship between the SDK and the k8s pod in your setup? Is the SDK running inside the pod or is the pod running a CAS service that the SDK is connected to?

@Ruoye-W
Copy link
Author

Ruoye-W commented Feb 26, 2024

I don't quite understand the issue you're describing. Can you clarify the relationship between the SDK and the k8s pod in your setup? Is the SDK running inside the pod or is the pod running a CAS service that the SDK is connected to?

Yes! The SDK has connected to the bazel-remote local CAS as a cache service, which is deployed through k8s. We've noticed that this cache service restarts during construction, cleaning up the local disk cache during restarts. After the bazel-remote restart, the RBE's remote compilation task calls FindMissingBlobs to get the list of missing files, but before the unified upload, it finds that the file has been uploaded by checking the global cache casUploaders, so RBE skips uploading this file. This leads to an error in the remote compilation cluster when compiling this task, stating that the bazel-remote's CAS missing files.

@mrahs
Copy link
Collaborator

mrahs commented Mar 4, 2024

Here is what I understood so far: You have a cache server that clears its state when it restarts. Using unified uploads, it's possible for FindMissingBlobs to hit the service right before it crashes and then the Read call to hit the service after it has reset itself and cleared its state, which causes the executor to get a cache miss when asking for inputs.

I'm not sure what the SDK can do in this case. This failure mode is inherent in the system design such that two subsequent calls are not guaranteed to see the same result from the same service. I don't think REAPI can work around such limitation as it assumes the CAS is stable long enough for two clients (build host and worker host) to see the same state.

Perhaps configuring the pod for bazel-remote with persistent storage would be the best approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants