Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test and fix for disappearing dependency issue #3627 #3669

Merged
merged 7 commits into from
Jul 6, 2021

Conversation

epicfaace
Copy link
Member

@epicfaace epicfaace commented Jul 4, 2021

Add test that replicates the disappearing dependency issue #3627 and adds a fix for it.

The issue appears to be that finishing a bundle with a dependency deletes the dependency for another running bundle that is dependent on that same dependency.

Here's how I found the issue:

I tested things out on this worksheet on dev: https://worksheets-dev.codalab.org/worksheets/0x5dd52af376db46f4bb57d57efe5a78d9

I noticed that there was a pattern; note that the time 0xac0230 fails is actually right after when 0x72c5c1 is completed.

image
image

In fact, the logs say that 0x72c5c1 changed to "finished" at 20:46:33.

2021-07-04 20:46:33,841 Finished run with UUID 0x72c5c18299804510b276262acd95cc0a, exitcode 0, failure_message None /opt/codalab/worker/worker_run_state.py 562

Just 3 seconds later is the failure time of the bundle 0xac0230 (20:46:36 = 20:37:40 + 8:56), which is pretty close to 20:46:33.

A similar pattern occurs for the last 3 bundles. Note how the first 2 bundles (which just check for the directory's existence from 1 to 10000 seconds) are working fine, until they fail right when the third bundle finishes after 100 seconds:

image

When looking at the logic that happens when a bundle is transitioned from cleaning up -> finished, note that dependencies are deleted in this section of code:

for path in self.paths_to_remove:
remove_path_no_fail(path)
self.paths_to_remove = []

This, indeed, is the problematic code and I've fixed this issue in this PR. This code was introduced by @teetone in #2295. @nelson-liu did identify that PR as the cause of the issue in #2440 (comment), but I think we probably missed that / further investigating into that issue.

@epicfaace epicfaace changed the title Add test for disappearing dependency issue https://github.com/codalab/codalab-worksheets/issues/3627 Add test for disappearing dependency issue #3627 Jul 4, 2021
@epicfaace epicfaace changed the title Add test for disappearing dependency issue #3627 Add test and fix disappearing dependency issue #3627 Jul 4, 2021
@epicfaace epicfaace changed the title Add test and fix disappearing dependency issue #3627 Add test and temporary fix for disappearing dependency issue #3627 Jul 4, 2021
@epicfaace
Copy link
Member Author

epicfaace commented Jul 4, 2021

Tested the fix on dev and it's working (if the issue was still there, bundles 1 and 2 would be in "failed" as soon as bundle 3 moved to "ready"):

image

I'm running the original train.py bundles now to ensure they're not failing on dev anymore.

@epicfaace epicfaace changed the title Add test and temporary fix for disappearing dependency issue #3627 Add test and fix for disappearing dependency issue #3627 Jul 5, 2021
@epicfaace
Copy link
Member Author

epicfaace commented Jul 5, 2021

@teetone The fix is that paths_to_remove needs to be stored on a per-bundle basis; otherwise, when one worker finishes bundle A, it can end up deleting dependency directories of bundle B. This PR changes paths_to_remove to be stored in the bundle state so that it is actually stored on a per-bundle basis.

debugging notes
slow
0x590f67a320d24318ab65cdeaa773a460

fast
0x1527b99d7a474dcc88a1593e9e29d145

2021-07-05 20:21:53,159 Received run message:
2021-07-05 20:21:58,223 ADDING PATHS TO REMOVE 3: /home/ubuntu/environment/codalab/codalab-worksheets/var/codalab/worker/runs/0x590f67a320d24318ab65cdeaa773a460/dir3 /opt/codalab/worker/worker_run_state.py 377
2021-07-05 20:22:07,883 ADDING PATHS TO REMOVE 3: /home/ubuntu/environment/codalab/codalab-worksheets/var/codalab/worker/runs/0x1527b99d7a474dcc88a1593e9e29d145/dir3 /opt/codalab/worker/worker_run_state.py 377
2021-07-05 20:22:28,982 Finished run with UUID 0x1527b99d7a474dcc88a1593e9e29d145, exitcode 0, failure_message None /opt/codalab/worker/worker_run_state.py 565
2021-07-05 20:22:39,801 PATHS TO REMOVE: ['/home/ubuntu/environment/codalab/codalab-worksheets/var/codalab/worker/runs/0x590f67a320d24318ab65cdeaa773a460/dir3', '/home/ubuntu/environment/codalab/codalab-worksheets/var/codalab/worker/runs/0x1527b99d7a474dcc88a1593e9e29d145/dir3'] /opt/codalab/worker/worker_run_state.py 618
2021-07-05 20:22:39,801 Bundle 0x1527b99d7a474dcc88a1593e9e29d145 is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.UPLOADING_RESULTS /opt/codalab/worker/worker_run_state.py 141
2021-07-05 20:22:47,881 Finished run with UUID 0x590f67a320d24318ab65cdeaa773a460, exitcode 2, failure_message None /opt/codalab/worker/worker_run_state.py 565
2021-07-05 20:22:56,998 PATHS TO REMOVE: [] /opt/codalab/worker/worker_run_state.py 618
2021-07-05 20:22:56,998 Bundle 0x590f67a320d24318ab65cdeaa773a460 is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.UPLOADING_RESULTS /opt/codalab/worker/worker_run_state.py 141

@@ -354,7 +354,9 @@ def mount_dependency(dependency, shared_file_system):
parent_path=os.path.join(dependency_path, child),
)
)
self.paths_to_remove.append(child_path)
run_state = run_state._replace(
paths_to_remove=(run_state.paths_to_remove or []) + [child_path]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included or [] so we don't break existing runs on prod (which may not have paths_to_remove defined yet).

@adiprerepa
Copy link
Member

so if I'm understanding correctly, run_state stores the per-bundle state?

generally lgtm

@teetone
Copy link
Collaborator

teetone commented Jul 6, 2021

@epicfaace Awesome, thanks for figuring this out. Does this happen if a worker is running multiple bundles at once?

@epicfaace
Copy link
Member Author

Does this happen if a worker is running multiple bundles at once?

Yes, exactly.

@mergify mergify bot merged commit 3cc485e into master Jul 6, 2021
@mergify mergify bot deleted the test-disappear branch July 6, 2021 12:37
@epicfaace epicfaace mentioned this pull request Jul 6, 2021
@epicfaace epicfaace mentioned this pull request Aug 25, 2021
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run bundles fail inconsistently on a file or directory not found error
3 participants