-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Add test for multi-tenancy workaround and documentation to FAQ #32560
[tune] Add test for multi-tenancy workaround and documentation to FAQ #32560
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
doc/source/tune/faq.rst
Outdated
from ray.train.torch import TorchTrainer | ||
|
||
TorchTrainer.__name__ = "TorchTrainer_" + uuid.uuid4().hex[:8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python question - at this point is TorchTrainer
an isolated reference? I.e. will this name change leak if other places in the code import TorchTrainer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will update the name attribute globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not grab job ID from the runtime context and use that for registration as a name prefix? It would be great to use that automatically.
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
…ne/test-multi-tenancy
…n multi tenancy (#33095) In #32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in #30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from #32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…ray-project#32560) Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs. The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ray-project#32560) Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs. The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>
…ray-project#32560) Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs. The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: elliottower <elliot@elliottower.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: elliottower <elliot@elliottower.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs.
The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.