-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] The usage tune.with_parameters
leads to unwanted data sharing and possible corrupted results
#30091
Comments
Hi @jbedorf, Thanks for the detailed description and reproduction script. The issue here seems to be how Tune is registering the cc: @krfricke This seems related to adding better support for running multiple Tune experiments in parallel. |
Could you elaborate why you want to launch multiple tune runs concurrently in one ray cluster? |
Due to infrastructure reasons we have a single large Ray cluster used for multiple different tasks and by different users. This works mostly fine, except for the above described situation. |
Got it, I am not sure that Tune is designed to support running multiple jobs in the same ray cluster. |
Per Triage Sync: Need to confirm in docs this is not supported |
That is interesting. My understanding was that the it is more efficient to have a few large clusters (e.g. less overhead of having multiple head nodes). Especially in multi-user situations, with restricted access permissions, to for example Kubernetes clusters you would expect multiple users sharing a single Ray cluster. Similarly, why would you need a HA Ray cluster if you don't plan on having it shared/robust as users should just spin up small and temporary clusters. From the general documentation the advised method to run jobs on said cluster is via Ray AIR. Given that the Tuner.fit() is a wrapper around Ray Tune, all those would face similar issues. Similar with features like Ray Jobs, you would expect long running clusters with the ability to do different work in parallel. What would you suggest to use in situations where the cluster is shared by multiple teams? |
This is the limitation of ray tune. I am not knowledgable enough to speak about the ideal setup of multi-tenancy in Ray. @richardliaw could you loop in the right people here? Thanks! |
fwiw, I've recently run into the same issue even without using I launched multiple processes using the same |
The main problem here again is that Ray Tune uses a global key value store to register both the trainable and its parameters. The workaround here is to rename the trainable (or override the |
Thanks for the insight @krfricke . Is the multi-tenancy specific to Tune (and thus Ray Air Tuner) or is this Ray wide. Given the addition of Ray Jobs, and services like KubeRay with shareable endpoints I would assume it is set up for sharing. Otherwise there is not much need for having something like Ray Jobs, which makes me wonder what is the supported way to share and use a cluster among multiple users. |
We're working on resolving this issue and hope to include a fix for Ray 2.4. In the meantime, for reference, this is how you can workaround the issue in practice. For generic Ray Tune trainables:
For AIR trainers:
|
…n multi tenancy (#33095) In #32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in #30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from #32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: elliottower <elliot@elliottower.com>
…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
This has been fixed in 2.4+: #33095 |
What happened + What you expected to happen
Using the
tune.with_parameters
functionality results in that Trainables/runs have the risk of receiving the wrong data during launch and/or during restart when using a single Ray cluster. This leads to unexpected and possibly completely wrong results.Imagine the scenario:
Now what happens is that the parameters of run Y are stored under the same key as those of X, and as such overwriting the original values of run X. Now when run X would launch it would read the settings of run Y. This leads to unexpected behaviour as you never know for certain if you are using the right data. Something similar happens if a run is restarted due to a failure as there is a chance the original configuration has been overwritten by that of run launched at a later time.
This is caused by this construct where the
prefix
will always be the same as it is based on the name of the trainable and does not contain a unique element. However, there must be another cause, as just adding a unique value to the prefix is not enough to solve the problem. I suspect that the_Inner
class is not unique either, but I didn't investigate it further as I used another work around.I solved the issue by just not using the
with_parameters
construct and instead manually add/remove objects to the object store.Versions / Dependencies
Master. Python 3.8. Linux
Reproduction script
Usage of the script below (Note make
resources_per_trial
larger than 50% of your CPUs):What happens:
ValueError: Config: 1 does not match data: 2
)Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: