Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core|Serve]: Ray 2.39.0/2.40.0 keep runtime_env state between multiple ray.init() #49074

Closed
Martin4R opened this issue Dec 4, 2024 · 4 comments · Fixed by #49697
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order serve Ray Serve Related Issue

Comments

@Martin4R
Copy link

Martin4R commented Dec 4, 2024

What happened + What you expected to happen

We upgraded to Ray 2.39.0 and noticed our tests were failing. We tracked it down to Ray somehow take over runtime_env settings (precisely environment variables) from one ray.init(...) call to the next ray.init(...) call, even so there was a proper ray.shutdown() in between of them.
We did not find a workaround other than staying on Ray 2.38.0 for now.

Versions / Dependencies

  • the issue happens on Linux but not on MacOS
  • the issue happens on Linux in Ray 2.39.0 and 2.40.0, but not in 2.38.0
  • In all cases we used Python 3.12

Reproduction script

The issue can be easily reproduced by running the provided reproduction script within Ray CPU docker images of platform linux/amd64 and Python 3.12.

Script explanation

  • The script calls ray.init() with a runtime environment where the env-var "MY_ENV_VAR" is set to "my_value1".
  • It starts a MyServeDeployment within the Ray cluster, which prints out this env-var (should print "my_value1").
  • Then ray.shutdown() is called, which according to it's docs is supposed to "cleanup state between tests".
  • The script then calls ray.init() a second time with a runtime environment where the env-var "MY_ENV_VAR" is set to "my_value2".
  • It starts MyServeDeployment, which should now print "my_value2", but in the case of Ray 3.39.0 and Ray 3.40.0 it prints "my_value1", which is not the expected value.

Script for Python 3.12

import logging
from time import sleep
import os
import ray
from ray import runtime_context, serve

@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 0.1})
class MyServeDeployment:
    def __init__(self) -> None:
        env_var = os.environ.get("MY_ENV_VAR", "not found")
        logger = logging.getLogger("ray.serve")
        logger.info(f"MY_ENV_VAR within serve deployment INIT is: {env_var}")

    async def __call__(self) -> str:
        return ""

app = MyServeDeployment.bind()

env_vars_1: dict[str, str] = {
    'MY_ENV_VAR': 'my_value1'
}
ray.init(include_dashboard=False, runtime_env=runtime_context.RuntimeEnv(env_vars=env_vars_1))
serve.start()
serve.run(app)
sleep(5)
serve.shutdown()
ray.shutdown()

env_vars_2: dict[str, str] = {
    'MY_ENV_VAR': 'my_value2'
}
ray.init(include_dashboard=False, runtime_env=runtime_context.RuntimeEnv(env_vars=env_vars_2))
serve.start()
serve.run(app)
sleep(5)
serve.shutdown()
ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

@Martin4R Martin4R added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 4, 2024
@jcotant1 jcotant1 added serve Ray Serve Related Issue core Issues that should be addressed in Ray Core labels Dec 4, 2024
@jjyao jjyao added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 9, 2024
@jjyao
Copy link
Collaborator

jjyao commented Dec 9, 2024

@MortalHappiness can you take a look at this one?

@MortalHappiness
Copy link
Member

Found the PR that caused this problem: #48218. Trying to fix it now.

@akshay-anyscale akshay-anyscale removed the serve Ray Serve Related Issue label Dec 10, 2024
@MortalHappiness
Copy link
Member

MortalHappiness commented Dec 11, 2024

I haven't found the root cause, but reverting the changes of these 5 files in #48218 can fix this issue.

image

@jjyao jjyao added serve Ray Serve Related Issue and removed core Issues that should be addressed in Ray Core labels Dec 11, 2024
@jjyao
Copy link
Collaborator

jjyao commented Dec 11, 2024

@akshay-anyscale can we have some serve team member to look at it. @MortalHappiness found the offending commit already.

@edoakes edoakes closed this as completed in 0c4f6ac Jan 7, 2025
roshankathawate pushed a commit to roshankathawate/ray that referenced this issue Jan 9, 2025
…project#49697)

## Why are these changes needed?

State was being leaked across calls to `serve.run` due to in-place
mutation within `get_deploy_args`.

I've moved the logic into `build_app` and added associated unit tests.
Also added an integration test matching the original bug report.

## Related issue number

Closes ray-project#49074

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Roshan Kathawate <roshankathawate@gmail.com>
HYLcool pushed a commit to HYLcool/ray that referenced this issue Jan 13, 2025
…project#49697)

## Why are these changes needed?

State was being leaked across calls to `serve.run` due to in-place
mutation within `get_deploy_args`.

I've moved the logic into `build_app` and added associated unit tests.
Also added an integration test matching the original bug report.

## Related issue number

Closes ray-project#49074

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: lielin.hyl <lielin.hyl@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants