-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Observing Multiple Exceptions When Using Different Python Patch Versions #26443
Comments
Hi, |
Yeah it seems like there are some compatibility issues with different Python issues. We just had to experiment with different ones until it worked... Not a great solution so hopefully this gets some attention and maybe gets fixed soon 🤞 |
It seems that in this particular case the problem is caused by bpo-41249 and this particular commit python/cpython@fa674bd which was released in 3.9.7 where attribute It seems that ray is serializing (pickling?) python objects on the client side and deserializing on the head/worker. Currently syncing client version to be exactly the same as the head/worker version is the safest bet. This could mean, that one would need to build ray docker image oneself. |
"Currently syncing client version to be exactly the same as the head/worker version is the safest bet. This could mean, that one would need to build ray docker image oneself." >> Or alternatively we could get the images built by the Ray team not only with a OLD patch version of Python 3.9 (currently being built with 3.9.5), but also with the LATEST patch version (like 3.9.13 currently). That way anybody would want to stay on the latest patch version of a main version could use that 'latest' image. But yes, not foolproof solution for sure, as the client would need to upgrade to the 'latest' patch version at the same time as the Ray image is being updated to. That maybe harder than it sounds. |
Gonna put this as P2 since the particular issue is mitigated. |
Confirm that have a similar issue with 3.9.8 on the client and 3.9.6 on the cluster during using read_parquet() |
What happened + What you expected to happen
I am using a Ray cluster running on Kubernetes (managed by the Kuberay Operator). The head and worker pods are using the official
rayproject/ray:1.13.1-py39
Docker image. The Ray client is being started on another pod, which is running the Jupyter Docker Stacks image, that provides a Jupyterlab server with Python 3.9 as well. I'm running into quite a few scheduler exceptions when trying to run examples from the Ray docs. At first I thought it might be that the Ray client is on a separate machine from the Ray head or different versions of the Ray python library, but after doing some testing, connecting with clients running different patch versions of Python, it seems like the issue is somehow caused by that. For context, everything works fine when connecting with a client using Python3.9.5
or3.9.6
(3.9.5
is what the Ray head and worker pods are using) but I start seeing bugs when using a client with Python3.9.13
. I haven't tested which versions between those work/don't work but tried the different versions on multiple different client machines (running both Mac OS and Linux) with the same results. See below for two of the bugs I was getting:Trying to run a very simple example from the docs using the
ray.data.range
always gives me the same error on a client running3.9.13
:outputs the following error:
Running a more complex NYC Taxi example from the Ray docs:
Gives the following error:
There were other similar issues as well but I imagine they are all caused by the same thing. To workaround this for now, I am just going to try building the Ray worker/head image with Python
3.9.13
and see if that works. If it doesn't, I will switch to using3.9.6
on the Jupyterlab Server image. But wanted to get this on the radar in case anyone else is running into similar issues.Versions / Dependencies
Ray Head/Worker
OS:
Python:
Ray:
$ pip list | grep ray ray 1.13.0
Client
OS:
Python:
Ray:
$ pip list | grep ray ray 1.13.0
Reproduction script
To reproduce:
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: