[Tune] The usage `tune.with_parameters` leads to unwanted data sharing and possible corrupted results #30091

jbedorf · 2022-11-08T09:32:46Z

What happened + What you expected to happen

Using the tune.with_parameters functionality results in that Trainables/runs have the risk of receiving the wrong data during launch and/or during restart when using a single Ray cluster. This leads to unexpected and possibly completely wrong results.

Imagine the scenario:

Launch run X, but the cluster is currently full and as such becomes pending.
Launch run Y on the same cluster.

Now what happens is that the parameters of run Y are stored under the same key as those of X, and as such overwriting the original values of run X. Now when run X would launch it would read the settings of run Y. This leads to unexpected behaviour as you never know for certain if you are using the right data. Something similar happens if a run is restarted due to a failure as there is a chance the original configuration has been overwritten by that of run launched at a later time.

This is caused by this construct where the prefix will always be the same as it is based on the name of the trainable and does not contain a unique element. However, there must be another cause, as just adding a unique value to the prefix is not enough to solve the problem. I suspect that the _Inner class is not unique either, but I didn't investigate it further as I used another work around.

I solved the issue by just not using the with_parameters construct and instead manually add/remove objects to the object store.

Versions / Dependencies

Master. Python 3.8. Linux

Reproduction script

Usage of the script below (Note make resources_per_trial larger than 50% of your CPUs):

In terminal 1, launch Ray (ray start --head)
In terminal 2, run: python tune_test.py 1
In terminal 3 (withint 30 seconds of first launch),run: python tune_test.py 2

What happens:

The first run gets launched, and will run for 30 seconds
The second run gets launched and will be pending
First run crashes after 30 seconds and gets automatically restarted.
First run will report that the wrong configuration is used (ValueError: Config: 1 does not match data: 2)

import sys
import time

import ray
from ray import tune

ray.init(address="auto")


class MyTrainable(tune.Trainable):
    def setup(self, config, data=None):
        self.count = 0
        if config["name"] != data["name"]:
            raise ValueError(f"Config: {config['name']} does not match data: {data['name']}")

    def step(self):
        time.sleep(1)
        self.count += 1
        if self.count > 30:
            raise ValueError("Crashing the worker")
        return {"loss": self.count, "done": self.count > 100}


data = {"name": sys.argv[1]}
config = {"name": sys.argv[1]}

tuner = tune.run(
    tune.with_parameters(MyTrainable, data=data),
    resources_per_trial={"cpu": 5},
    config=config,
    max_failures=2,
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

justinvyu · 2022-11-08T18:32:18Z

Hi @jbedorf,

Thanks for the detailed description and reproduction script. The issue here seems to be how Tune is registering the MyTrainable. At the start of the first experiment, the Trainable gets registered in the global key value store here, under its name "MyTrainable". Then, the second experiment will actually register again under the same name, this time with new data attached, and this will be used when recovering the trial from the first experiment.

cc: @krfricke This seems related to adding better support for running multiple Tune experiments in parallel.

xwjiang2010 · 2022-11-08T21:07:31Z

Could you elaborate why you want to launch multiple tune runs concurrently in one ray cluster?

jbedorf · 2022-11-09T07:14:12Z

Due to infrastructure reasons we have a single large Ray cluster used for multiple different tasks and by different users. This works mostly fine, except for the above described situation.

xwjiang2010 · 2022-11-14T22:10:29Z

Got it, I am not sure that Tune is designed to support running multiple jobs in the same ray cluster.
Usually people start multiple clusters with each cluster doing one tuning job.

hora-anyscale · 2022-11-14T22:16:06Z

Per Triage Sync: Need to confirm in docs this is not supported

jbedorf · 2022-11-15T07:50:44Z

Got it, I am not sure that Tune is designed to support running multiple jobs in the same ray cluster. Usually people start multiple clusters with each cluster doing one tuning job.

That is interesting. My understanding was that the it is more efficient to have a few large clusters (e.g. less overhead of having multiple head nodes). Especially in multi-user situations, with restricted access permissions, to for example Kubernetes clusters you would expect multiple users sharing a single Ray cluster. Similarly, why would you need a HA Ray cluster if you don't plan on having it shared/robust as users should just spin up small and temporary clusters.

From the general documentation the advised method to run jobs on said cluster is via Ray AIR. Given that the Tuner.fit() is a wrapper around Ray Tune, all those would face similar issues. Similar with features like Ray Jobs, you would expect long running clusters with the ability to do different work in parallel.

What would you suggest to use in situations where the cluster is shared by multiple teams?

xwjiang2010 · 2022-11-15T18:24:43Z

This is the limitation of ray tune. I am not knowledgable enough to speak about the ideal setup of multi-tenancy in Ray. @richardliaw could you loop in the right people here? Thanks!

walid0925 · 2022-11-17T15:56:55Z

fwiw, I've recently run into the same issue even without using tune.with_parameters and just using partial

I launched multiple processes using the same trainable, aka an objective function - but using different datasets for each. one was a regression task and one was a classification task, and the results in the Tune CLI showed that these were intermingling

krfricke · 2022-12-19T15:14:12Z

The main problem here again is that Ray Tune uses a global key value store to register both the trainable and its parameters. The workaround here is to rename the trainable (or override the __name__ attribute) though multi tenancy is generally not supported at the moment - i.e. we never test it and we don't guarantee that it works. We hope to refactor some of this logic in the future, but it's lower priority.

jbedorf · 2022-12-19T20:29:48Z

Thanks for the insight @krfricke . Is the multi-tenancy specific to Tune (and thus Ray Air Tuner) or is this Ray wide. Given the addition of Ray Jobs, and services like KubeRay with shareable endpoints I would assume it is set up for sharing. Otherwise there is not much need for having something like Ray Jobs, which makes me wonder what is the supported way to share and use a cluster among multiple users.

krfricke · 2023-02-15T16:52:31Z

We're working on resolving this issue and hope to include a fix for Ray 2.4.

In the meantime, for reference, this is how you can workaround the issue in practice.

For generic Ray Tune trainables:

    import uuid

    MyTrainable.__name__ = "MyTrainable" + uuid.uuid4().hex[:8]

    tuner = Tuner(
        MyTrainable,
        # ...
    )
    tuner.fit()

For AIR trainers:

    import uuid
    from ray.train.torch import TorchTrainer

    TorchTrainer.__name__ = "TorchTrainer_" + uuid.uuid4().hex[:8]

    trainer = TorchTrainer(
        # ...
    )
    trainer.fit()

…n multi tenancy (#33095) In #32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in #30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from #32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>

…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com>

…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: elliottower <elliot@elliottower.com>

…n multi tenancy (ray-project#33095) In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091. This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials. To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore. To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

justinvyu · 2023-10-30T20:16:01Z

This has been fixed in 2.4+: #33095

jbedorf added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 8, 2022

xwjiang2010 added tune Tune-related issues air labels Nov 14, 2022

hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022

bveeramani mentioned this issue Dec 28, 2022

[Tune] Can never run two tune scripts at all - if I run two ray tune scripts, the first script will be using the code of second script and abandon its own code (With detailed and easy reproduction; with bug cause analysis) #31330

Closed

krfricke mentioned this issue Mar 7, 2023

[tune] Prefix global object registry with job ID to avoid conflicts in multi tenancy #33095

Merged

7 tasks

anyscalesam removed the air label Oct 28, 2023

justinvyu closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] The usage `tune.with_parameters` leads to unwanted data sharing and possible corrupted results #30091

[Tune] The usage `tune.with_parameters` leads to unwanted data sharing and possible corrupted results #30091

jbedorf commented Nov 8, 2022 •

edited

Loading

justinvyu commented Nov 8, 2022

xwjiang2010 commented Nov 8, 2022

jbedorf commented Nov 9, 2022

xwjiang2010 commented Nov 14, 2022

hora-anyscale commented Nov 14, 2022

jbedorf commented Nov 15, 2022

xwjiang2010 commented Nov 15, 2022

walid0925 commented Nov 17, 2022

krfricke commented Dec 19, 2022

jbedorf commented Dec 19, 2022

krfricke commented Feb 15, 2023

justinvyu commented Oct 30, 2023

[Tune] The usage tune.with_parameters leads to unwanted data sharing and possible corrupted results #30091

[Tune] The usage tune.with_parameters leads to unwanted data sharing and possible corrupted results #30091

Comments

jbedorf commented Nov 8, 2022 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

justinvyu commented Nov 8, 2022

xwjiang2010 commented Nov 8, 2022

jbedorf commented Nov 9, 2022

xwjiang2010 commented Nov 14, 2022

hora-anyscale commented Nov 14, 2022

jbedorf commented Nov 15, 2022

xwjiang2010 commented Nov 15, 2022

walid0925 commented Nov 17, 2022

krfricke commented Dec 19, 2022

jbedorf commented Dec 19, 2022

krfricke commented Feb 15, 2023

justinvyu commented Oct 30, 2023

[Tune] The usage `tune.with_parameters` leads to unwanted data sharing and possible corrupted results #30091

[Tune] The usage `tune.with_parameters` leads to unwanted data sharing and possible corrupted results #30091

jbedorf commented Nov 8, 2022 •

edited

Loading