Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Observing Multiple Exceptions When Using Different Python Patch Versions #26443

Open
johnwalz97 opened this issue Jul 11, 2022 · 6 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core-client ray client related issues P2 Important issue, but not time-critical

Comments

@johnwalz97
Copy link

johnwalz97 commented Jul 11, 2022

What happened + What you expected to happen

I am using a Ray cluster running on Kubernetes (managed by the Kuberay Operator). The head and worker pods are using the official rayproject/ray:1.13.1-py39 Docker image. The Ray client is being started on another pod, which is running the Jupyter Docker Stacks image, that provides a Jupyterlab server with Python 3.9 as well. I'm running into quite a few scheduler exceptions when trying to run examples from the Ray docs. At first I thought it might be that the Ray client is on a separate machine from the Ray head or different versions of the Ray python library, but after doing some testing, connecting with clients running different patch versions of Python, it seems like the issue is somehow caused by that. For context, everything works fine when connecting with a client using Python 3.9.5 or 3.9.6 (3.9.5 is what the Ray head and worker pods are using) but I start seeing bugs when using a client with Python 3.9.13. I haven't tested which versions between those work/don't work but tried the different versions on multiple different client machines (running both Mac OS and Linux) with the same results. See below for two of the bugs I was getting:

Trying to run a very simple example from the docs using the ray.data.range always gives me the same error on a client running 3.9.13:

ds = ray.data.range(10000)

outputs the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_13038/1519848610.py in <cell line: 7>()
      5 # ])
      6 
----> 7 ds = ray.data.range(10000)
      8 
      9 # or lets load all data for 2009

/opt/conda/lib/python3.9/site-packages/ray/data/read_api.py in range(n, parallelism)
    130         Dataset holding the integers.
    131     """
--> 132     return read_datasource(
    133         RangeDatasource(), parallelism=parallelism, n=n, block_format="list"
    134     )

/opt/conda/lib/python3.9/site-packages/ray/data/read_api.py in read_datasource(datasource, parallelism, ray_remote_args, **read_args)
    238             _prepare_read, retry_exceptions=False, num_cpus=0
    239         )
--> 240         read_tasks = ray.get(
    241             prepare_read.remote(
    242                 datasource,

/opt/conda/lib/python3.9/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    102             # we only convert init function if RAY_CLIENT_MODE=1
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
--> 104                 return getattr(ray, func.__name__)(*args, **kwargs)
    105         return func(*args, **kwargs)
    106 

/opt/conda/lib/python3.9/site-packages/ray/util/client/api.py in get(self, vals, timeout)
     42             timeout: Optional timeout in milliseconds
     43         """
---> 44         return self.worker.get(vals, timeout=timeout)
     45 
     46     def put(self, *args, **kwargs):

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in get(self, vals, timeout)
    436                 op_timeout = max_blocking_operation_time
    437             try:
--> 438                 res = self._get(to_get, op_timeout)
    439                 break
    440             except GetTimeoutError:

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in _get(self, ref, timeout)
    480         except grpc.RpcError as e:
    481             raise decode_exception(e)
--> 482         return loads_from_server(data)
    483 
    484     def put(self, val, *, client_ref_id: bytes = None):

/opt/conda/lib/python3.9/site-packages/ray/util/client/client_pickler.py in loads_from_server(data, fix_imports, encoding, errors)
    174         raise TypeError("Can't load pickle from unicode string")
    175     file = io.BytesIO(data)
--> 176     return ServerUnpickler(
    177         file, fix_imports=fix_imports, encoding=encoding, errors=errors
    178     ).load()

/opt/conda/lib/python3.9/typing.py in inner(*args, **kwds)
    272         def inner(*args, **kwds):
    273             try:
--> 274                 return cached(*args, **kwds)
    275             except TypeError:
    276                 pass  # All real errors (not unhashable args) are raised below.

/opt/conda/lib/python3.9/typing.py in __hash__(self)
    573 
    574     def __hash__(self):
--> 575         return hash((self.__forward_arg__, self.__forward_module__))
    576 
    577     def __repr__(self):

AttributeError: __forward_module__

Running a more complex NYC Taxi example from the Ray docs:

# What's the longets trip distance, largest tip amount, and most number of passengers?
ds.max(["trip_distance", "tip_amount", "passenger_count"])

Gives the following error:

Read:   0%|          | 0/2 [00:00<?, ?it/s]Caught schedule exception
Caught schedule exception
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_13000/2423519160.py in <cell line: 2>()
      1 # What's the longets trip distance, largest tip amount, and most number of passengers?
----> 2 ds.max(["trip_distance", "tip_amount", "passenger_count"])

/opt/conda/lib/python3.9/site-packages/ray/data/dataset.py in max(self, on, ignore_nulls)
   1399             AND ``ignore_nulls`` is ``False``, then the output will be None.
   1400         """
-> 1401         ret = self._aggregate_on(Max, on, ignore_nulls)
   1402         return self._aggregate_result(ret)
   1403 

/opt/conda/lib/python3.9/site-packages/ray/data/dataset.py in _aggregate_on(self, agg_cls, on, *args, **kwargs)
   3290         """
   3291         aggs = self._build_multicolumn_aggs(agg_cls, on, *args, **kwargs)
-> 3292         return self.aggregate(*aggs)
   3293 
   3294     def _build_multicolumn_aggs(

/opt/conda/lib/python3.9/site-packages/ray/data/dataset.py in aggregate(self, *aggs)
   1222             If the dataset is empty, return ``None``.
   1223         """
-> 1224         ret = self.groupby(None).aggregate(*aggs).take(1)
   1225         return ret[0] if len(ret) > 0 else None
   1226 

/opt/conda/lib/python3.9/site-packages/ray/data/grouped_dataset.py in aggregate(self, *aggs)
    133 
    134         plan = self._dataset._plan.with_stage(AllToAllStage("aggregate", None, do_agg))
--> 135         return Dataset(
    136             plan,
    137             self._dataset._epoch,

/opt/conda/lib/python3.9/site-packages/ray/data/dataset.py in __init__(self, plan, epoch, lazy)
    175 
    176         if not lazy:
--> 177             self._plan.execute(allow_clear_input_blocks=False)
    178 
    179     @staticmethod

/opt/conda/lib/python3.9/site-packages/ray/data/impl/plan.py in execute(self, allow_clear_input_blocks, force_read)
    255                     clear_input_blocks = False
    256                 stats_builder = stats.child_builder(stage.name)
--> 257                 blocks, stage_info = stage(blocks, clear_input_blocks)
    258                 if stage_info:
    259                     stats = stats_builder.build_multistage(stage_info)

/opt/conda/lib/python3.9/site-packages/ray/data/impl/plan.py in __call__(self, blocks, clear_input_blocks)
    434     ) -> Tuple[BlockList, dict]:
    435         compute = get_compute(self.compute)
--> 436         blocks = compute._apply(
    437             self.block_fn, self.ray_remote_args, blocks, clear_input_blocks, self.name
    438         )

/opt/conda/lib/python3.9/site-packages/ray/data/impl/compute.py in _apply(self, fn, remote_args, block_list, clear_input_blocks, name)
     70         # Common wait for non-data refs.
     71         try:
---> 72             results = map_bar.fetch_until_complete(refs)
     73         except (ray.exceptions.RayTaskError, KeyboardInterrupt) as e:
     74             # One or more mapper tasks failed, or we received a SIGINT signal

/opt/conda/lib/python3.9/site-packages/ray/data/impl/progress_bar.py in fetch_until_complete(self, refs)
     72         t = threading.current_thread()
     73         while remaining:
---> 74             done, remaining = ray.wait(remaining, fetch_local=True, timeout=0.1)
     75             for ref, result in zip(done, ray.get(done)):
     76                 ref_to_result[ref] = result

/opt/conda/lib/python3.9/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    102             # we only convert init function if RAY_CLIENT_MODE=1
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
--> 104                 return getattr(ray, func.__name__)(*args, **kwargs)
    105         return func(*args, **kwargs)
    106 

/opt/conda/lib/python3.9/site-packages/ray/util/client/api.py in wait(self, *args, **kwargs)
     61             kwargs: opaque keyword arguments
     62         """
---> 63         return self.worker.wait(*args, **kwargs)
     64 
     65     def remote(self, *args, **kwargs):

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in wait(self, object_refs, num_returns, timeout, fetch_local)
    527                 )
    528         data = {
--> 529             "object_ids": [object_ref.id for object_ref in object_refs],
    530             "num_returns": num_returns,
    531             "timeout": timeout if (timeout is not None) else -1,

/opt/conda/lib/python3.9/site-packages/ray/util/client/worker.py in <listcomp>(.0)
    527                 )
    528         data = {
--> 529             "object_ids": [object_ref.id for object_ref in object_refs],
    530             "num_returns": num_returns,
    531             "timeout": timeout if (timeout is not None) else -1,

/opt/conda/lib/python3.9/site-packages/ray/util/client/common.py in id(self)
    139     @property
    140     def id(self):
--> 141         return self.binary()
    142 
    143     def future(self) -> Future:

/opt/conda/lib/python3.9/site-packages/ray/util/client/common.py in binary(self)
    118 
    119     def binary(self):
--> 120         self._wait_for_id()
    121         return super().binary()
    122 

/opt/conda/lib/python3.9/site-packages/ray/util/client/common.py in _wait_for_id(self, timeout)
    195             with self._mutex:
    196                 if self._id_future:
--> 197                     self._set_id(self._id_future.result(timeout=timeout))
    198                     self._id_future = None
    199 

/opt/conda/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
    444                     raise CancelledError()
    445                 elif self._state == FINISHED:
--> 446                     return self.__get_result()
    447                 else:
    448                     raise TimeoutError()

/opt/conda/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
    389         if self._exception:
    390             try:
--> 391                 raise self._exception
    392             finally:
    393                 # Break a reference cycle with the exception in self._exception

AttributeError: 'ForwardRef' object has no attribute '__forward_is_class__'

There were other similar issues as well but I imagine they are all caused by the same thing. To workaround this for now, I am just going to try building the Ray worker/head image with Python 3.9.13 and see if that works. If it doesn't, I will switch to using 3.9.6 on the Jupyterlab Server image. But wanted to get this on the radar in case anyone else is running into similar issues.

Versions / Dependencies

Ray Head/Worker

OS:

$ cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Python:

$ python --version
Python 3.9.5

Ray:

$ pip list | grep ray
ray                                    1.13.0

Client

OS:

$ cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Python:

$ python --version
Python 3.9.13

Ray:

$ pip list | grep ray
ray                                    1.13.0

Reproduction script

To reproduce:

  1. Create a Ray cluster that is running on Python 3.9.5
  2. Connect to that Ray cluster on a client running Python 3.9.13
  3. Run the following snippets of code:
ds = ray.data.range(10000)
ds = ray.data.read_parquet([
    "s3://ursa-labs-taxi-data/2009/01/data.parquet",
    "s3://ursa-labs-taxi-data/2009/02/data.parquet",
])
  1. You should see the same errors as above

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@johnwalz97 johnwalz97 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 11, 2022
@jdonzallaz
Copy link

Hi,
I have the same error (AttributeError: __forward_module__) when calling serve.start(detached=True) on python 3.9.13, 3.9.12 and 3.9.11.
It works fine with python 3.9.9 and 3.9.10.

@johnwalz97
Copy link
Author

Yeah it seems like there are some compatibility issues with different Python issues. We just had to experiment with different ones until it worked... Not a great solution so hopefully this gets some attention and maybe gets fixed soon 🤞

@pauleikis
Copy link

It seems that in this particular case the problem is caused by bpo-41249 and this particular commit python/cpython@fa674bd which was released in 3.9.7 where attribute __forward_module__ was introduced.

It seems that ray is serializing (pickling?) python objects on the client side and deserializing on the head/worker.
If my hunch is true, this will be a common issue when moving forward. Any client with a newer version will risk causing similar issues.

Currently syncing client version to be exactly the same as the head/worker version is the safest bet. This could mean, that one would need to build ray docker image oneself.

@zoltan-fedor
Copy link
Contributor

zoltan-fedor commented Aug 9, 2022

"Currently syncing client version to be exactly the same as the head/worker version is the safest bet. This could mean, that one would need to build ray docker image oneself." >> Or alternatively we could get the images built by the Ray team not only with a OLD patch version of Python 3.9 (currently being built with 3.9.5), but also with the LATEST patch version (like 3.9.13 currently). That way anybody would want to stay on the latest patch version of a main version could use that 'latest' image.

But yes, not foolproof solution for sure, as the client would need to upgrade to the 'latest' patch version at the same time as the Ray image is being updated to. That maybe harder than it sounds.

@zhe-thoughts zhe-thoughts added P2 Important issue, but not time-critical core-client ray client related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
@zhe-thoughts zhe-thoughts self-assigned this Oct 26, 2022
@zhe-thoughts
Copy link
Collaborator

Gonna put this as P2 since the particular issue is mitigated.

@skabbit
Copy link

skabbit commented Jan 20, 2023

Confirm that have a similar issue with 3.9.8 on the client and 3.9.6 on the cluster during using read_parquet()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core-client ray client related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants