-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shallow copies become deep copies when pickling #1058
Comments
The plan is stop making default indexes with I'm not confident that your work around will work properly. At the very least, you should check If it would really help, I'm open to making |
If I'm understanding you correctly @crusaderky, I think this is a tough problem, and one much broader than xarray. When pickling something with a reference, do you want to save the object, or the reference? If you pickle the reference, how can you guarantee to have the object available when unpickling? How would you codify the reference (memory location?)? Is that right? Or am I misunderstanding your problem? On this narrow case, I think not having indexes at all should solve this, though |
@MaximilianR, if you pickle 2 plain python objects A and B together, and one of the attributes of B is a reference to A, A does not get duplicated. In this case there must be some specific getstate code to prevent this and/or something with the C implementation of the class |
@crusaderky right, I see. All those views are in the same pickle object, and so shouldn't be duplicated. That is frustrating. As per @shoyer, the easiest way is to just not have the data in the first place. So not needing indexes at all should solve your case. |
I answered the StackOverflow question: This was a tricky puzzle to figure out! |
Confirmed that #1017 fixes my specific issue, thanks! |
I think this is fixed about as well as we can hope given how pickle works for NumPy by #1128. So I'm closing this now, but feel free to open another issue for any follow-up concerns. |
Actually, I very much still am facing the problem.
What broadcast does is transform the scalar array to a numpy array of 2**19 elements. This is actually a view on the original 0D array, so it's got negligible RAM requirements. But after pickling and unpickling, it's become a real 2**19 elements array. Add up a few hundreds of them, and I am facing GBs of wasted RAM. A solution would be to change broadcast() to convert to dask before broadcasting, and then broadcast directly to the proper chunk size. |
@crusaderky Yes, I think it could be reasonable to unify array types when you call If your scalar array is the result of an expensive dask calculation, this also might be a good use case for dask's new |
Alternatively, it could make sense to change pickle upstream in NumPy to special case arrays with a stride of 0 along some dimension differently. |
Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays.
This design fails when the object is pickled.
Whenever a numpy view is pickled, it becomes a regular array:
This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer.
Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.
I see a few possible solutions to this:
I implemented (5) as a workaround in my getstate method.
Before:
Workaround:
After:
The text was updated successfully, but these errors were encountered: