Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] out of band serialization exception #47544

Merged

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Sep 6, 2024

Why are these changes needed?

This PR

This PR is backward compatible

  • Introduce an env var to raise an exception when there's out of band seriailzation of object ref
  • Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
  • Update an anti-pattern doc.

cloudpickle.dumps error message

E                       ray.exceptions.RayTaskError(OufOfBandRefSerializationException): ray::f() (pid=61703, ip=127.0.0.1)
E                         File "/Users/sangcho/work/ray/python/ray/tests/test_serialization.py", line 751, in f
E                           cloudpickle.dumps(ray.put(1))
E                         File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
E                           cp.dump(obj)
E                         File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
E                           return super().dump(obj)
E                       ray.exceptions.OufOfBandRefSerializationException: It is not allowed to serialize ray.ObjectRef 00ef45ccd0112571ffffffffffffffffffffffff0100000002e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
E                       Callsite: Disabled. Set RAY_record_ref_creation_sites=1

implicit capture error

2024-09-06 17:47:28,806	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
Traceback (most recent call last):
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 73, in pickle_dumps
    return pickle.dumps(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 152, in object_ref_reducer
    self.add_contained_object_ref(obj, obj.call_site())
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 221, in add_contained_object_ref
    raise ray.exceptions.OufOfBandRefSerializationException(
ray.exceptions.OufOfBandRefSerializationException: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sangcho/work/ray/a.py", line 11, in <module>
    ray.get(f.remote())
  File "/Users/sangcho/work/ray/python/ray/remote_function.py", line 139, in _remote_proxy
    return self._remote(args=args, kwargs=kwargs, **self._default_options)
  File "/Users/sangcho/work/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/Users/sangcho/work/ray/python/ray/util/tracing/tracing_helper.py", line 310, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/sangcho/work/ray/python/ray/remote_function.py", line 304, in _remote
    self._pickled_function = pickle_dumps(
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 81, in pickle_dumps
    raise ray.exceptions.OufOfBandRefSerializationException(msg)
ray.exceptions.OufOfBandRefSerializationException: Could not serialize the function a.f:
=======================================================
Checking Serializability of <function f at 0x10bceb1f0>
=======================================================
!!! FAIL serialization: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1
Detected 1 global variables. Checking serializability...
    Serializing 'ref' ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505)...
    !!! FAIL serialization: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1
    WARNING: Did not find non-serializable object in ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505). This may be an oversight.
=======================================================
Variable: 

	FailTuple(ref [obj=ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505), parent=<function f at 0x10bceb1f0>])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. 
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. 
=======================================================
Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/
=======================================================

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 requested a review from a team as a code owner September 6, 2024 22:58
"If you set the env var, the object is pinned forever in the "
"lifetime of the worker process and can cause Ray object leaks."
"See the trace to find where the serialization occurs: "
f"{''.join(traceback.format_stack())}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, we should be able to tell the containing object from this stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't. but we can see which line triggers the serialization. This is the error msg with your repro script.

ray::f() (pid=37282, ip=127.0.0.1)
  File "/Users/sangcho/work/ray/a.py", line 20, in f
    return ArrowBlockAccessor.numpy_to_block(batch)
  File "/Users/sangcho/work/ray/python/ray/data/_internal/arrow_block.py", line 249, in numpy_to_block
    col = ArrowPythonObjectArray.from_objects(col)
  File "/Users/sangcho/work/ray/python/ray/air/util/object_extensions/arrow.py", line 100, in from_objects
    dumped_bytes = pickle_dumps(
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
ray._private.serialization.OufOfBandRefSerializationException: It is not allowed to serialize ray.ObjectRef 00d950ec0ccf9d2affffffffffffffffffffffff0100000002e1f505.If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks.See the trace to find where the serialization occurs:   File "/Users/sangcho/work/ray/python/ray/_private/workers/default_worker.py", line 289, in <module>
    worker.main_loop()
  File "/Users/sangcho/work/ray/a.py", line 20, in f
    return ArrowBlockAccessor.numpy_to_block(batch)
  File "/Users/sangcho/work/ray/python/ray/data/_internal/arrow_block.py", line 249, in numpy_to_block
    col = ArrowPythonObjectArray.from_objects(col)
  File "/Users/sangcho/work/ray/python/ray/air/util/object_extensions/arrow.py", line 100, in from_objects
    dumped_bytes = pickle_dumps(
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also new error with capturing object ref;

2024-09-06 17:47:28,806	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
Traceback (most recent call last):
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 73, in pickle_dumps
    return pickle.dumps(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/sangcho/work/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 152, in object_ref_reducer
    self.add_contained_object_ref(obj, obj.call_site())
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 221, in add_contained_object_ref
    raise ray.exceptions.OufOfBandRefSerializationException(
ray.exceptions.OufOfBandRefSerializationException: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sangcho/work/ray/a.py", line 11, in <module>
    ray.get(f.remote())
  File "/Users/sangcho/work/ray/python/ray/remote_function.py", line 139, in _remote_proxy
    return self._remote(args=args, kwargs=kwargs, **self._default_options)
  File "/Users/sangcho/work/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/Users/sangcho/work/ray/python/ray/util/tracing/tracing_helper.py", line 310, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/sangcho/work/ray/python/ray/remote_function.py", line 304, in _remote
    self._pickled_function = pickle_dumps(
  File "/Users/sangcho/work/ray/python/ray/_private/serialization.py", line 81, in pickle_dumps
    raise ray.exceptions.OufOfBandRefSerializationException(msg)
ray.exceptions.OufOfBandRefSerializationException: Could not serialize the function a.f:
=======================================================
Checking Serializability of <function f at 0x10bceb1f0>
=======================================================
!!! FAIL serialization: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1
Detected 1 global variables. Checking serializability...
    Serializing 'ref' ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505)...
    !!! FAIL serialization: It is not allowed to serialize ray.ObjectRef 00ffffffffffffffffffffffffffffffffffffff0100000001e1f505. If you want to allow serialization, set `RAY_allow_out_of_band_object_ref_serialization=1.` If you set the env var, the object is pinned forever in the lifetime of the worker process and can cause Ray object leaks. See the callsite and trace to find where the serialization occurs.
Callsite: Disabled. Set RAY_record_ref_creation_sites=1
    WARNING: Did not find non-serializable object in ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505). This may be an oversight.
=======================================================
Variable: 

	FailTuple(ref [obj=ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000001e1f505), parent=<function f at 0x10bceb1f0>])

was found to be non-serializable. There may be multiple other undetected variables that were non-serializable. 
Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class. 
=======================================================
Check https://docs.ray.io/en/master/ray-core/objects/serialization.html#troubleshooting for more information.
If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/
=======================================================

@rkooo567 rkooo567 assigned rkooo567 and unassigned jjyao, raulchen and angelinalg Sep 9, 2024
Signed-off-by: Hao Chen <chenh1024@gmail.com>
python/ray/_private/serialization.py Outdated Show resolved Hide resolved
@@ -127,7 +136,7 @@ def actor_handle_reducer(obj):
serialized, actor_handle_id, weak_ref = obj._serialization_helper()
# Update ref counting for the actor handle
if not weak_ref:
self.add_contained_object_ref(actor_handle_id)
self.add_contained_object_ref(actor_handle_id, True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's always True for this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it crashes so many tests now. And I think the leak is probably very minimal for actor handle. I will add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it doesn't leak actual actors)

@rkooo567 rkooo567 enabled auto-merge (squash) September 12, 2024 23:47
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 12, 2024
@rkooo567 rkooo567 enabled auto-merge (squash) September 14, 2024 08:11
@rkooo567
Copy link
Contributor Author

need approval from the data team. cc @raulchen

@rkooo567 rkooo567 merged commit d07975e into ray-project:master Sep 14, 2024
6 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Introduce an env var to raise an exception when there's out of band seriailzation of object ref
Improve error message on out of band serialization issue. There are 2 types of issues. 1. cloudpikcle.dumps(ref). 2. implicit capture. See below for more details.
Update an anti-pattern doc.

Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants