Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Tune & RLlib] Unable to Run RLlib Tune Experiment with Ray Client: AttributeError: 'ForwardRef' object has no attribute '__forward_module__' #28461

Open
peterghaddad opened this issue Sep 13, 2022 · 12 comments
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks rllib RLlib related issues

Comments

@peterghaddad
Copy link
Contributor

peterghaddad commented Sep 13, 2022

What happened + What you expected to happen

When running Ray Tune from a Jupyter Notebook using tune.Tuner produces the following Stacktrace:

I am utilizing the Ray Client for submitting the job. I saw a similar issue related which is now closed (this utilized Ray Core). #20012

Please Reference Reproduction Script #1

Put failed:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [6], line 4
      1 from ray.rllib.algorithms.ppo import PPO
      2 from ray import air, tune
----> 4 tuner = tune.Tuner(
      5     PPO,
      6     run_config=air.RunConfig(
      7         name="pbt_humanoid_test",
      8     ),
      9     tune_config=tune.TuneConfig(
     10         num_samples=8,
     11         metric="episode_reward_mean",
     12         mode="max",
     13     ),
     14     param_space={
     15         "env": "Humanoid-v1",
     16         "kl_coeff": 1.0,
     17         "num_workers": 8,
     18         "num_gpus": 1,
     19         "model": {"free_log_std": True},
     20         # These params are tuned from a fixed starting value.
     21         "lambda": 0.95,
     22         "clip_param": 0.2,
     23         "lr": 1e-4,
     24         # These params start off randomly drawn from a set.
     25         "num_sgd_iter": tune.choice([10, 20, 30]),
     26         "sgd_minibatch_size": tune.choice([128, 512, 2048]),
     27         "train_batch_size": tune.choice([10000, 20000, 40000]),
     28     },
     29 )
     30 results = tuner.fit()

File ~/.local/lib/python3.9/site-packages/ray/tune/tuner.py:146, in Tuner.__init__(self, trainable, param_space, tune_config, run_config, _tuner_kwargs, _tuner_internal)
    144     self._local_tuner = TunerInternal(**kwargs)
    145 else:
--> 146     self._remote_tuner = _force_on_current_node(
    147         ray.remote(num_cpus=0)(TunerInternal)
    148     ).remote(**kwargs)

File ~/.local/lib/python3.9/site-packages/ray/actor.py:637, in ActorClass.options.<locals>.ActorOptionWrapper.remote(self, *args, **kwargs)
    636 def remote(self, *args, **kwargs):
--> 637     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)

File ~/.local/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py:387, in _tracing_actor_creation.<locals>._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs)
    385 if not _is_tracing_enabled():
    386     assert "_ray_trace_ctx" not in kwargs
--> 387     return method(self, args, kwargs, *_args, **_kwargs)
    389 class_name = self.__ray_metadata__.class_name
    390 method_name = "__init__"

File ~/.local/lib/python3.9/site-packages/ray/actor.py:765, in ActorClass._remote(self, args, kwargs, **actor_options)
    762     actor_options["max_concurrency"] = 1000 if is_asyncio else 1
    764 if client_mode_should_convert(auto_init=True):
--> 765     return client_mode_convert_actor(self, args, kwargs, **actor_options)
    767 # fill actor required options
    768 for k, v in ray_option_utils.actor_options.items():

File ~/.local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:198, in client_mode_convert_actor(actor_cls, in_args, in_kwargs, **kwargs)
    196     setattr(actor_cls, RAY_CLIENT_MODE_ATTR, key)
    197 client_actor = ray._get_converted(key)
--> 198 return client_actor._remote(in_args, in_kwargs, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:396, in ClientActorClass._remote(self, args, kwargs, **option_args)
    394 if kwargs is None:
    395     kwargs = {}
--> 396 return self.options(**option_args).remote(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:593, in ActorOptionWrapper.remote(self, *args, **kwargs)
    591 def remote(self, *args, **kwargs):
    592     self._remote_stub._init_signature.bind(*args, **kwargs)
--> 593     futures = ray.call_remote(self, *args, **kwargs)
    594     assert len(futures) == 1
    595     actor_class = None

File ~/.local/lib/python3.9/site-packages/ray/util/client/api.py:100, in _ClientAPI.call_remote(self, instance, *args, **kwargs)
     86 def call_remote(self, instance: "ClientStub", *args, **kwargs) -> List[Future]:
     87     """call_remote is called by stub objects to execute them remotely.
     88 
     89     This is used by stub objects in situations where they're called
   (...)
     98         kwargs: opaque keyword arguments
     99     """
--> 100     return self.worker.call_remote(instance, *args, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/worker.py:545, in Worker.call_remote(self, instance, *args, **kwargs)
    544 def call_remote(self, instance, *args, **kwargs) -> List[Future]:
--> 545     task = instance._prepare_client_task()
    546     # data is serialized tuple of (args, kwargs)
    547     task.data = dumps_from_client((args, kwargs), self._client_id)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:578, in OptionWrapper._prepare_client_task(self)
    577 def _prepare_client_task(self):
--> 578     task = self._remote_stub._prepare_client_task()
    579     set_task_options(task, self._options)
    580     return task

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:407, in ClientActorClass._prepare_client_task(self)
    406 def _prepare_client_task(self) -> ray_client_pb2.ClientTask:
--> 407     self._ensure_ref()
    408     task = ray_client_pb2.ClientTask()
    409     task.type = ray_client_pb2.ClientTask.ACTOR

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:377, in ClientActorClass._ensure_ref(self)
    374 # Check pickled size before sending it to server, which is more
    375 # efficient and can be done synchronously inside remote() call.
    376 check_oversized_function(data, self._name, "actor", None)
--> 377 self._ref = ray.worker._put_pickled(
    378     data, client_ref_id=self._client_side_ref.id
    379 )

File ~/.local/lib/python3.9/site-packages/ray/util/client/worker.py:499, in Worker._put_pickled(self, data, client_ref_id)
    497 if not resp.valid:
    498     try:
--> 499         raise cloudpickle.loads(resp.error)
    500     except (pickle.UnpicklingError, TypeError):
    501         logger.exception("Failed to deserialize {}".format(resp.error))

AttributeError: 'ForwardRef' object has no attribute '__forward_module__'

Versions / Dependencies

Running Ray 2.0 with Kuberay v0.3.0

pip list

absl-py                      0.11.0
aiohttp                      3.8.1
aiohttp-cors                 0.7.0
aiorwlock                    1.3.0
aiosignal                    1.2.0
anyio                        3.6.1
argon2-cffi                  21.3.0
argon2-cffi-bindings         21.2.0
asgiref                      3.5.2
asttokens                    2.0.8
astunparse                   1.6.3
async-timeout                4.0.2
attrs                        22.1.0
Babel                        2.10.3
backcall                     0.2.0
beautifulsoup4               4.11.1
bleach                       5.0.1
blessed                      1.19.1
Box2D-kengz                  2.3.3
box2d-py                     2.3.8
cachetools                   4.2.4
certifi                      2022.6.15
cffi                         1.15.1
charset-normalizer           2.1.1
click                        7.1.2
cloudpickle                  1.6.0
colorful                     0.5.4
cycler                       0.11.0
debugpy                      1.6.3
decorator                    5.1.1
defusedxml                   0.7.1
Deprecated                   1.2.13
distlib                      0.3.6
dm-tree                      0.1.7
docstring-parser             0.15
entrypoints                  0.4
enum34                       1.1.8
executing                    1.0.0
fastapi                      0.82.0
fastjsonschema               2.16.1
ffmpeg-python                0.2.0
filelock                     3.8.0
fire                         0.4.0
flatbuffers                  2.0.7
fonttools                    4.37.1
frozenlist                   1.3.1
future                       0.18.2
gast                         0.5.3
google-api-core              2.8.2
google-api-python-client     1.12.11
google-auth                  1.35.0
google-auth-httplib2         0.1.0
google-auth-oauthlib         0.4.6
google-cloud-core            2.3.2
google-cloud-storage         1.44.0
google-crc32c                1.5.0
google-pasta                 0.2.0
google-resumable-media       2.3.3
googleapis-common-protos     1.56.4
gpustat                      1.0.0
GPUtil                       1.4.0
grpcio                       1.43.0
gym                          0.23.1
gym-notices                  0.0.8
h11                          0.13.0
h5py                         3.7.0
httplib2                     0.20.4
idna                         3.3
imageio                      2.21.2
importlib-metadata           4.12.0
ipykernel                    6.15.2
ipython                      8.5.0
ipython-genutils             0.2.0
jedi                         0.18.1
Jinja2                       3.1.2
json5                        0.9.10
jsonschema                   3.2.0
jupyter_client               7.3.5
jupyter-core                 4.11.1
jupyter-server               1.18.1
jupyter-server-proxy         3.2.1
jupyterlab                   3.4.4
jupyterlab-pygments          0.2.2
jupyterlab_server            2.15.1
keras                        2.8.0
Keras-Preprocessing          1.1.2
kfp                          1.7.2
kfp-pipeline-spec            0.1.16
kfp-server-api               1.8.5
kfp-tekton                   1.1.0
kiwisolver                   1.4.4
kubernetes                   12.0.1
libclang                     14.0.6
lxml                         4.9.1
lz4                          4.0.2
Markdown                     3.4.1
MarkupSafe                   2.1.1
matplotlib                   3.5.3
matplotlib-inline            0.1.6
mistune                      2.0.4
msgpack                      1.0.4
multidict                    6.0.2
nbclassic                    0.4.3
nbclient                     0.6.7
nbconvert                    7.0.0
nbformat                     5.4.0
nest-asyncio                 1.5.5
networkx                     2.8.6
notebook                     6.4.12
notebook-shim                0.1.0
numpy                        1.23.2
nvidia-ml-py                 11.495.46
oauthlib                     3.2.0
opencensus                   0.11.0
opencensus-context           0.1.3
opencv-python                4.6.0.66
opt-einsum                   3.3.0
packaging                    21.3
pandas                       1.4.4
pandocfilters                1.5.0
parso                        0.8.3
pexpect                      4.8.0
pickleshare                  0.7.5
Pillow                       9.2.0
pip                          22.2.2
platformdirs                 2.5.2
prometheus-client            0.13.1
prompt-toolkit               3.0.31
protobuf                     3.19.4
psutil                       5.9.2
ptyprocess                   0.7.0
pure-eval                    0.2.2
py-spy                       0.3.14
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycparser                    2.21
pydantic                     1.10.2
pygame                       2.1.2
Pygments                     2.13.0
pyparsing                    3.0.9
pyrsistent                   0.18.1
python-dateutil              2.8.2
pytz                         2022.2.1
PyWavelets                   1.3.0
PyYAML                       5.4.1
pyzmq                        23.2.1
ray                          2.0.0
requests                     2.28.1
requests-oauthlib            1.3.1
requests-toolbelt            0.9.1
rsa                          4.9
scikit-image                 0.19.3
scipy                        1.9.1
Send2Trash                   1.8.0
setuptools                   57.5.0
simpervisor                  0.4
six                          1.16.0
smart-open                   6.1.0
sniffio                      1.3.0
soupsieve                    2.3.2.post1
stack-data                   0.5.0
starlette                    0.19.1
strip-hints                  0.1.10
tabulate                     0.8.10
tensorboard                  2.8.0
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorboardX                 2.5.1
tensorflow                   2.8.3
tensorflow-estimator         2.8.0
tensorflow-io-gcs-filesystem 0.27.0
termcolor                    1.1.0
terminado                    0.15.0
tifffile                     2022.8.12
tinycss2                     1.1.1
tornado                      6.2
traitlets                    5.3.0
typing_extensions            4.3.0
uritemplate                  3.0.1
urllib3                      1.26.12
uvicorn                      0.16.0
virtualenv                   20.16.5
wcwidth                      0.2.5
webencodings                 0.5.1
websocket-client             1.4.1
Werkzeug                     2.2.2
wheel                        0.37.0
wrapt                        1.14.1
yarl                         1.8.1
zipp                         3.8.1

As an additional side error: I receive the following when running within the Job SDK for the example script. This only occurs if I set my working directory to a different directory than my current $PWD.

"/home/ray/anaconda3/lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n  import imp\n/home/ray/anaconda3/lib/python3.9/site-packages/botocore/vendored/requests/packages/urllib3/_collections.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working\n  from collections import Mapping, MutableMapping\n"

This is the specific example I am following.

Reproduction script

Reproduction Script Number 1

runtime_env = { "pip": ["awscli", "tensorflow", "gym", "torch"]}

info = ray.init(
    address=ray_init_url,
    runtime_env=runtime_env)

# Start Ray Tune


from ray import tune, air
from ray.tune.syncer import (
    SyncConfig
)
from ray.rllib.algorithms.ppo import PPO

tuner = tune.Tuner(
    PPO,
    run_config=air.RunConfig(
        name="pbt_humanoid_test",
    ),
    tune_config=tune.TuneConfig(
        num_samples=8,
        metric="episode_reward_mean",
        mode="max",
    ),
    param_space={
        "env": "Humanoid-v1",
        "kl_coeff": 1.0,
        "num_workers": 8,
        "num_gpus": 1,
        "model": {"free_log_std": True},
        # These params are tuned from a fixed starting value.
        "lambda": 0.95,
        "clip_param": 0.2,
        "lr": 1e-4,
        # These params start off randomly drawn from a set.
        "num_sgd_iter": tune.choice([10, 20, 30]),
        "sgd_minibatch_size": tune.choice([128, 512, 2048]),
        "train_batch_size": tune.choice([10000, 20000, 40000]),
    },
)
results = tuner.fit()


I have also tried a simpler script which gives the exact error:

```python

analysis = tune.Tuner(
    PPO,
    run_config=air.config.RunConfig(
        local_dir="/storage/",
        sync_config=tune.SyncConfig(
            syncer=None
        ),
    ),
    tune_config=tune.TuneConfig(
        metric="episode_reward_mean",
        mode="max"
    ),
    param_space={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 1,
        "lr": tune.grid_search([0.01]),
    }
)

### Issue Severity

High: It blocks me from completing my task.
@peterghaddad peterghaddad added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2022
@peterghaddad peterghaddad changed the title [Ray Tune] Unable to Run Any Tune Experiment with Ray Client: AttributeError: 'ForwardRef' object has no attribute '__forward_module__' [Ray Tune & RLlib] Unable to Run RLlib Tune Experiment with Ray Client: AttributeError: 'ForwardRef' object has no attribute '__forward_module__' Sep 13, 2022
@DmitriGekhtman
Copy link
Contributor

Tagging @ckw017.

I'd lean towards Ray job submission for submitting jobs, rather than using Ray client.

@peterghaddad
Copy link
Contributor Author

peterghaddad commented Sep 13, 2022

@DmitriGekhtman Using the Job SDK is fine, but is there a way to return a value after job completion? For example, at the end of the job script returning the best checkpoint would be useful. We can do get_job_logs but I don't see a solution for returning a specific value after job completion.

@DmitriGekhtman
Copy link
Contributor

Getting a Python object out of a job isn't currently supported, but I think it could make sense and might not be too bad to implement.
From a Ray developer perspective, it would be great if we can slightly enrich job submission to achieve parity with the relevant Ray client features.

If you open a Ray feature request issue, we could get more discussion going.

@peterghaddad
Copy link
Contributor Author

I created the following issue. I agree. For the above script, it doesn't take too long to execute so I would think the Ray client is fine to utilize. I do believe this issue is still relevant and any insight would be helpful.

@peterghaddad
Copy link
Contributor Author

peterghaddad commented Sep 15, 2022

@DmitriGekhtman @ckw017 As an FYI, I get the above errors when using any Ray Serve with the client.

Put failed:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [7], line 34
     30         obs = json_input["observation"]
     32         action = self.trainer.compute_s
---> 34 serve.start(
     35     http_options={"host": "0.0.0.0"}
     36 )
     38 ServePPOModel.deploy(best_checkpoint_path)

File ~/.local/lib/python3.9/site-packages/ray/serve/api.py:99, in start(detached, http_options, dedicated_cpu, **kwargs)
     53 @guarded_deprecation_warning(instructions=MIGRATION_MESSAGE)
     54 @Deprecated(message=MIGRATION_MESSAGE)
     55 def start(
   (...)
     59     **kwargs,
     60 ) -> ServeControllerClient:
     61     """Initialize a serve instance.
     62 
     63     By default, the instance will be scoped to the lifetime of the returned
   (...)
     97           Serve controller actor.  Defaults to False.
     98     """
---> 99     client = _private_api.serve_start(detached, http_options, dedicated_cpu, **kwargs)
    101     # Record after Ray has been started.
    102     record_extra_usage_tag(TagKey.SERVE_API_VERSION, "v1")

File ~/.local/lib/python3.9/site-packages/ray/serve/_private/api.py:196, in serve_start(detached, http_options, dedicated_cpu, **kwargs)
    193 # Used for scheduling things to the head node explicitly.
    194 # Assumes that `serve.start` runs on the head node.
    195 head_node_id = ray.get_runtime_context().node_id.hex()
--> 196 controller = ServeController.options(
    197     num_cpus=1 if dedicated_cpu else 0,
    198     name=controller_name,
    199     lifetime="detached" if detached else None,
    200     max_restarts=-1,
    201     max_task_retries=-1,
    202     # Schedule the controller on the head node with a soft constraint. This
    203     # prefers it to run on the head node in most cases, but allows it to be
    204     # restarted on other nodes in an HA cluster.
    205     scheduling_strategy=NodeAffinitySchedulingStrategy(head_node_id, soft=True)
    206     if RAY_INTERNAL_SERVE_CONTROLLER_PIN_ON_NODE
    207     else None,
    208     namespace=SERVE_NAMESPACE,
    209     max_concurrency=CONTROLLER_MAX_CONCURRENCY,
    210 ).remote(
    211     controller_name,
    212     http_config=http_options,
    213     head_node_id=head_node_id,
    214     detached=detached,
    215 )
    217 proxy_handles = ray.get(controller.get_http_proxies.remote())
    218 if len(proxy_handles) > 0:

File ~/.local/lib/python3.9/site-packages/ray/actor.py:637, in ActorClass.options.<locals>.ActorOptionWrapper.remote(self, *args, **kwargs)
    636 def remote(self, *args, **kwargs):
--> 637     return actor_cls._remote(args=args, kwargs=kwargs, **updated_options)

File ~/.local/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py:387, in _tracing_actor_creation.<locals>._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs)
    385 if not _is_tracing_enabled():
    386     assert "_ray_trace_ctx" not in kwargs
--> 387     return method(self, args, kwargs, *_args, **_kwargs)
    389 class_name = self.__ray_metadata__.class_name
    390 method_name = "__init__"

File ~/.local/lib/python3.9/site-packages/ray/actor.py:765, in ActorClass._remote(self, args, kwargs, **actor_options)
    762     actor_options["max_concurrency"] = 1000 if is_asyncio else 1
    764 if client_mode_should_convert(auto_init=True):
--> 765     return client_mode_convert_actor(self, args, kwargs, **actor_options)
    767 # fill actor required options
    768 for k, v in ray_option_utils.actor_options.items():

File ~/.local/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:198, in client_mode_convert_actor(actor_cls, in_args, in_kwargs, **kwargs)
    196     setattr(actor_cls, RAY_CLIENT_MODE_ATTR, key)
    197 client_actor = ray._get_converted(key)
--> 198 return client_actor._remote(in_args, in_kwargs, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:396, in ClientActorClass._remote(self, args, kwargs, **option_args)
    394 if kwargs is None:
    395     kwargs = {}
--> 396 return self.options(**option_args).remote(*args, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:593, in ActorOptionWrapper.remote(self, *args, **kwargs)
    591 def remote(self, *args, **kwargs):
    592     self._remote_stub._init_signature.bind(*args, **kwargs)
--> 593     futures = ray.call_remote(self, *args, **kwargs)
    594     assert len(futures) == 1
    595     actor_class = None

File ~/.local/lib/python3.9/site-packages/ray/util/client/api.py:100, in _ClientAPI.call_remote(self, instance, *args, **kwargs)
     86 def call_remote(self, instance: "ClientStub", *args, **kwargs) -> List[Future]:
     87     """call_remote is called by stub objects to execute them remotely.
     88 
     89     This is used by stub objects in situations where they're called
   (...)
     98         kwargs: opaque keyword arguments
     99     """
--> 100     return self.worker.call_remote(instance, *args, **kwargs)

File ~/.local/lib/python3.9/site-packages/ray/util/client/worker.py:545, in Worker.call_remote(self, instance, *args, **kwargs)
    544 def call_remote(self, instance, *args, **kwargs) -> List[Future]:
--> 545     task = instance._prepare_client_task()
    546     # data is serialized tuple of (args, kwargs)
    547     task.data = dumps_from_client((args, kwargs), self._client_id)

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:578, in OptionWrapper._prepare_client_task(self)
    577 def _prepare_client_task(self):
--> 578     task = self._remote_stub._prepare_client_task()
    579     set_task_options(task, self._options)
    580     return task

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:407, in ClientActorClass._prepare_client_task(self)
    406 def _prepare_client_task(self) -> ray_client_pb2.ClientTask:
--> 407     self._ensure_ref()
    408     task = ray_client_pb2.ClientTask()
    409     task.type = ray_client_pb2.ClientTask.ACTOR

File ~/.local/lib/python3.9/site-packages/ray/util/client/common.py:377, in ClientActorClass._ensure_ref(self)
    374 # Check pickled size before sending it to server, which is more
    375 # efficient and can be done synchronously inside remote() call.
    376 check_oversized_function(data, self._name, "actor", None)
--> 377 self._ref = ray.worker._put_pickled(
    378     data, client_ref_id=self._client_side_ref.id
    379 )

File ~/.local/lib/python3.9/site-packages/ray/util/client/worker.py:499, in Worker._put_pickled(self, data, client_ref_id)
    497 if not resp.valid:
    498     try:
--> 499         raise cloudpickle.loads(resp.error)
    500     except (pickle.UnpicklingError, TypeError):
    501         logger.exception("Failed to deserialize {}".format(resp.error))

AttributeError: 'ForwardRef' object has no attribute '__forward_module__'
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-09-15 18:00:45,873	WARNING dataclient.py:395 -- Encountered connection issues in the data channel. Attempting to reconnect.

@ckw017
Copy link
Member

ckw017 commented Sep 15, 2022

Will take a look today, what version of Ray did you spot this on?

@peterghaddad
Copy link
Contributor Author

peterghaddad commented Sep 15, 2022

@ckw017 This is on Ray 2.0. My full pip is above. I'm seeing this when running both locally and in a Jupyter Notebook in the cloud.

@ckw017
Copy link
Member

ckw017 commented Sep 15, 2022

Ah hmm, found this issue: #26443

Can you double check the python version where you're running the script and the python version where the cluster is running?

@peterghaddad
Copy link
Contributor Author

peterghaddad commented Sep 15, 2022

@ckw017 Server is running 3.9.5 and client is running 3.9.7.

The Ray image I am using is ray:2.0.0-py39. I will do some additional testing on my side. If you get time, can you try using a later version of Python to see if you run into this issue as well? Might be a good test case.

@peterghaddad
Copy link
Contributor Author

Would the hope to enable Ray to work with 3.9.7? Will ray:2.0.0-py39 ever increase its version from 3.9.5 to > 3.9.5? Definitely the problem is with 3.9.7 or greater.

@ckw017
Copy link
Member

ckw017 commented Sep 15, 2022

I'll check if I run into the same issue. The easiest fix I can think of here would be to switch your local version to 3.9.5 (if you're using conda, you can do something like conda create -n py395 python=3.9.5).

I suspect that Ray will work on 3.9.7, it would just need a compatible version (i.e. same patch version) of python on the client side as well.

@ckw017
Copy link
Member

ckw017 commented Sep 15, 2022

^Was able to reproduce with 3.9.7 client and 3.9.5 cluster. I'll add a warning if the user's local patch version mismatches the cluster's patch version.

Wanted to double check if the workaround of using 3.9.5 locally is possible for you

@richardliaw richardliaw added the rllib RLlib related issues label Oct 7, 2022
@kouroshHakha kouroshHakha added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks rllib RLlib related issues
Projects
None yet
Development

No branches or pull requests

5 participants