Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] The usage tune.with_parameters leads to unwanted data sharing and possible corrupted results #30091

Closed
jbedorf opened this issue Nov 8, 2022 · 12 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@jbedorf
Copy link
Contributor

jbedorf commented Nov 8, 2022

What happened + What you expected to happen

Using the tune.with_parameters functionality results in that Trainables/runs have the risk of receiving the wrong data during launch and/or during restart when using a single Ray cluster. This leads to unexpected and possibly completely wrong results.

Imagine the scenario:

  • Launch run X, but the cluster is currently full and as such becomes pending.
  • Launch run Y on the same cluster.

Now what happens is that the parameters of run Y are stored under the same key as those of X, and as such overwriting the original values of run X. Now when run X would launch it would read the settings of run Y. This leads to unexpected behaviour as you never know for certain if you are using the right data. Something similar happens if a run is restarted due to a failure as there is a chance the original configuration has been overwritten by that of run launched at a later time.

This is caused by this construct where the prefix will always be the same as it is based on the name of the trainable and does not contain a unique element. However, there must be another cause, as just adding a unique value to the prefix is not enough to solve the problem. I suspect that the _Inner class is not unique either, but I didn't investigate it further as I used another work around.

I solved the issue by just not using the with_parameters construct and instead manually add/remove objects to the object store.

Versions / Dependencies

Master. Python 3.8. Linux

Reproduction script

Usage of the script below (Note make resources_per_trial larger than 50% of your CPUs):

  • In terminal 1, launch Ray (ray start --head)
  • In terminal 2, run: python tune_test.py 1
  • In terminal 3 (withint 30 seconds of first launch),run: python tune_test.py 2

What happens:

  • The first run gets launched, and will run for 30 seconds
  • The second run gets launched and will be pending
  • First run crashes after 30 seconds and gets automatically restarted.
  • First run will report that the wrong configuration is used (ValueError: Config: 1 does not match data: 2)
import sys
import time

import ray
from ray import tune

ray.init(address="auto")


class MyTrainable(tune.Trainable):
    def setup(self, config, data=None):
        self.count = 0
        if config["name"] != data["name"]:
            raise ValueError(f"Config: {config['name']} does not match data: {data['name']}")

    def step(self):
        time.sleep(1)
        self.count += 1
        if self.count > 30:
            raise ValueError("Crashing the worker")
        return {"loss": self.count, "done": self.count > 100}


data = {"name": sys.argv[1]}
config = {"name": sys.argv[1]}

tuner = tune.run(
    tune.with_parameters(MyTrainable, data=data),
    resources_per_trial={"cpu": 5},
    config=config,
    max_failures=2,
)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@jbedorf jbedorf added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 8, 2022
@justinvyu
Copy link
Contributor

Hi @jbedorf,

Thanks for the detailed description and reproduction script. The issue here seems to be how Tune is registering the MyTrainable. At the start of the first experiment, the Trainable gets registered in the global key value store here, under its name "MyTrainable". Then, the second experiment will actually register again under the same name, this time with new data attached, and this will be used when recovering the trial from the first experiment.

cc: @krfricke This seems related to adding better support for running multiple Tune experiments in parallel.

@xwjiang2010
Copy link
Contributor

Could you elaborate why you want to launch multiple tune runs concurrently in one ray cluster?

@jbedorf
Copy link
Contributor Author

jbedorf commented Nov 9, 2022

Due to infrastructure reasons we have a single large Ray cluster used for multiple different tasks and by different users. This works mostly fine, except for the above described situation.

@xwjiang2010 xwjiang2010 added tune Tune-related issues air labels Nov 14, 2022
@xwjiang2010
Copy link
Contributor

Got it, I am not sure that Tune is designed to support running multiple jobs in the same ray cluster.
Usually people start multiple clusters with each cluster doing one tuning job.

@hora-anyscale
Copy link
Contributor

Per Triage Sync: Need to confirm in docs this is not supported

@hora-anyscale hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2022
@jbedorf
Copy link
Contributor Author

jbedorf commented Nov 15, 2022

Got it, I am not sure that Tune is designed to support running multiple jobs in the same ray cluster. Usually people start multiple clusters with each cluster doing one tuning job.

That is interesting. My understanding was that the it is more efficient to have a few large clusters (e.g. less overhead of having multiple head nodes). Especially in multi-user situations, with restricted access permissions, to for example Kubernetes clusters you would expect multiple users sharing a single Ray cluster. Similarly, why would you need a HA Ray cluster if you don't plan on having it shared/robust as users should just spin up small and temporary clusters.

From the general documentation the advised method to run jobs on said cluster is via Ray AIR. Given that the Tuner.fit() is a wrapper around Ray Tune, all those would face similar issues. Similar with features like Ray Jobs, you would expect long running clusters with the ability to do different work in parallel.

What would you suggest to use in situations where the cluster is shared by multiple teams?

@xwjiang2010
Copy link
Contributor

This is the limitation of ray tune. I am not knowledgable enough to speak about the ideal setup of multi-tenancy in Ray. @richardliaw could you loop in the right people here? Thanks!

@walid0925
Copy link

fwiw, I've recently run into the same issue even without using tune.with_parameters and just using partial

I launched multiple processes using the same trainable, aka an objective function - but using different datasets for each. one was a regression task and one was a classification task, and the results in the Tune CLI showed that these were intermingling

@krfricke
Copy link
Contributor

The main problem here again is that Ray Tune uses a global key value store to register both the trainable and its parameters. The workaround here is to rename the trainable (or override the __name__ attribute) though multi tenancy is generally not supported at the moment - i.e. we never test it and we don't guarantee that it works. We hope to refactor some of this logic in the future, but it's lower priority.

@jbedorf
Copy link
Contributor Author

jbedorf commented Dec 19, 2022

Thanks for the insight @krfricke . Is the multi-tenancy specific to Tune (and thus Ray Air Tuner) or is this Ray wide. Given the addition of Ray Jobs, and services like KubeRay with shareable endpoints I would assume it is set up for sharing. Otherwise there is not much need for having something like Ray Jobs, which makes me wonder what is the supported way to share and use a cluster among multiple users.

@krfricke
Copy link
Contributor

We're working on resolving this issue and hope to include a fix for Ray 2.4.

In the meantime, for reference, this is how you can workaround the issue in practice.

For generic Ray Tune trainables:

    import uuid

    MyTrainable.__name__ = "MyTrainable" + uuid.uuid4().hex[:8]

    tuner = Tuner(
        MyTrainable,
        # ...
    )
    tuner.fit()

For AIR trainers:

    import uuid
    from ray.train.torch import TorchTrainer

    TorchTrainer.__name__ = "TorchTrainer_" + uuid.uuid4().hex[:8]

    trainer = TorchTrainer(
        # ...
    )
    trainer.fit()

krfricke added a commit that referenced this issue Mar 7, 2023
…n multi tenancy (#33095)

In #32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in #30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from #32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue Mar 21, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this issue Mar 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
…n multi tenancy (ray-project#33095)

In ray-project#32560, we documented a workaround for the multi tenancy issues in Ray Tune, e.g. described in ray-project#30091.

This PR fixes the root issue by prefixing the global registry with the core worker job ID, which is unique per driver process. This will avoid conflicts between parallel running tune trials.

To prove that it works, we modify the fix from ray-project#32560 to not require a workaround anymore.

To avoid cluttering the global key-value store with stale objects, we also de-register objects from the global KV store after finishing a Ray Tune run.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
@anyscalesam anyscalesam removed the air label Oct 28, 2023
@justinvyu
Copy link
Contributor

This has been fixed in 2.4+: #33095

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

7 participants