Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the dynamic spawn strategy is broken for remote execution #8646

Closed
buchgr opened this issue Jun 17, 2019 · 1 comment
Closed

the dynamic spawn strategy is broken for remote execution #8646

buchgr opened this issue Jun 17, 2019 · 1 comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged

Comments

@buchgr
Copy link
Contributor

buchgr commented Jun 17, 2019

We have received several bug reports about the dynamic spawn strategy not working properly with remote execution / caching. I took some time to look through the code to better understand the issue. I believe to now understand the heart of the issue.

The dynamic spawn strategy in a single line:

SpawnResult result =
ExecutorService.invokeAny(localExecutionCallable, remoteExecutionCallable)`

This line takes the result of the first callable to finish and calls Future.cancel(mayInterruptIfRunning=true) on the slower one. The bug is that the Callable's have side effects in that both write their outputs to the file system and incidentally also to the same files.

This has two issues:

  1. The effective output files of a single action can end up being a mix of both local and remote execution. This would mostly go unnoticed (esp. if the actions are reproducible) except when it doesn't then it would lead to hard to debug correctness bugs.
  2. The output files Bazel sees can be corrupted. This can be the case if both the remote strategy and local execution strategy write to the same file concurrently.

I believe 2.) is what users are most likely to see.

@buchgr
Copy link
Contributor Author

buchgr commented Jun 17, 2019

The fix is to call SpawnExecutionContext.lockOutputfiles() before starting downloading of outputs.

buchgr added a commit to buchgr/bazel that referenced this issue Jun 17, 2019
This change fixes the correctness issue of dynamic spawn scheduler when
being used with remote execution. See bazelbuild#8646 for more details.
buchgr added a commit to buchgr/bazel that referenced this issue Jun 17, 2019
This change fixes the correctness issue of dynamic spawn scheduler when
being used with remote execution. See bazelbuild#8646 for more details.

There's a performance issue remaining: bazelbuild#8467
@jin jin added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Jun 17, 2019
buchgr added a commit to buchgr/bazel that referenced this issue Jun 18, 2019
This change fixes the correctness issue of dynamic spawn scheduler when
being used with remote execution. See bazelbuild#8646 for more details.

There's a performance issue remaining: bazelbuild#8467
siberex pushed a commit to siberex/bazel that referenced this issue Jul 4, 2019
This change fixes the correctness issue of dynamic spawn scheduler when being used with remote execution. See bazelbuild#8646 for more details.

There's a performance issue remaining: bazelbuild#8647

Closes bazelbuild#8648.

PiperOrigin-RevId: 253998300
irengrig pushed a commit to irengrig/bazel that referenced this issue Jul 15, 2019
This change fixes the correctness issue of dynamic spawn scheduler when being used with remote execution. See bazelbuild#8646 for more details.

There's a performance issue remaining: bazelbuild#8647

Closes bazelbuild#8648.

PiperOrigin-RevId: 253998300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged
Projects
None yet
Development

No branches or pull requests

2 participants