-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test and fix for disappearing dependency issue #3627 #3669
Conversation
Tested the fix on dev and it's working (if the issue was still there, bundles 1 and 2 would be in "failed" as soon as bundle 3 moved to "ready"): I'm running the original train.py bundles now to ensure they're not failing on dev anymore. |
@teetone The fix is that debugging notes
|
@@ -354,7 +354,9 @@ def mount_dependency(dependency, shared_file_system): | |||
parent_path=os.path.join(dependency_path, child), | |||
) | |||
) | |||
self.paths_to_remove.append(child_path) | |||
run_state = run_state._replace( | |||
paths_to_remove=(run_state.paths_to_remove or []) + [child_path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included or []
so we don't break existing runs on prod (which may not have paths_to_remove
defined yet).
so if I'm understanding correctly, generally lgtm |
@epicfaace Awesome, thanks for figuring this out. Does this happen if a worker is running multiple bundles at once? |
Yes, exactly. |
Add test that replicates the disappearing dependency issue #3627 and adds a fix for it.
The issue appears to be that finishing a bundle with a dependency deletes the dependency for another running bundle that is dependent on that same dependency.
Here's how I found the issue:
I tested things out on this worksheet on dev: https://worksheets-dev.codalab.org/worksheets/0x5dd52af376db46f4bb57d57efe5a78d9
I noticed that there was a pattern; note that the time 0xac0230 fails is actually right after when 0x72c5c1 is completed.
In fact, the logs say that 0x72c5c1 changed to "finished" at 20:46:33.
Just 3 seconds later is the failure time of the bundle 0xac0230 (20:46:36 = 20:37:40 + 8:56), which is pretty close to 20:46:33.
A similar pattern occurs for the last 3 bundles. Note how the first 2 bundles (which just check for the directory's existence from 1 to 10000 seconds) are working fine, until they fail right when the third bundle finishes after 100 seconds:
When looking at the logic that happens when a bundle is transitioned from cleaning up -> finished, note that dependencies are deleted in this section of code:
codalab-worksheets/codalab/worker/worker_run_state.py
Lines 610 to 612 in 178fa23
This, indeed, is the problematic code and I've fixed this issue in this PR. This code was introduced by @teetone in #2295. @nelson-liu did identify that PR as the cause of the issue in #2440 (comment), but I think we probably missed that / further investigating into that issue.