Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Add test for multi-tenancy workaround and documentation to FAQ #32560

Merged
merged 20 commits into from
Feb 16, 2023

Conversation

krfricke
Copy link
Contributor

Why are these changes needed?

Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs.

The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 4 commits February 14, 2023 17:30
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
from ray.train.torch import TorchTrainer

TorchTrainer.__name__ = "TorchTrainer_" + uuid.uuid4().hex[:8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python question - at this point is TorchTrainer an isolated reference? I.e. will this name change leak if other places in the code import TorchTrainer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it will update the name attribute globally.

ericl
ericl previously requested changes Feb 15, 2023
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not grab job ID from the runtime context and use that for registration as a name prefix? It would be great to use that automatically.

doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
krfricke and others added 3 commits February 16, 2023 08:51
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
doc/source/tune/faq.rst Outdated Show resolved Hide resolved
krfricke and others added 3 commits February 16, 2023 09:47
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke krfricke dismissed ericl’s stale review February 16, 2023 21:12

Addressed comments

@krfricke krfricke merged commit 61d0590 into ray-project:master Feb 16, 2023
@krfricke krfricke deleted the tune/test-multi-tenancy branch February 16, 2023 21:12
krfricke added a commit that referenced this pull request Mar 7, 2023
…n multi tenancy (#33095)

In #32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in #30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from #32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request Mar 21, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ray-project#32560)

Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs.

The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…ray-project#32560)

Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs.

The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ray-project#32560)

Ray Tune does not officially support multi-tenancy, but we see some users still using it. They then run into problems with the cluster-global trainable registry, which will overwrite trainables with the same name from different tuning jobs.

The workaround here is to use a unique name for every trainable. This is currently undocumented. This PR adds a section to the Ray Tune FAQ explaining the workaround (with a big disclaimer on why multi-tenancy might still be a bad idea). It also adds a unit test that constructs a conflict situation and tests that the workaround mitigates the problem.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: elliottower <elliot@elliottower.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants