Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance : speeding up pickling of cftime arrays #253

Open
aulemahal opened this issue Aug 10, 2021 · 7 comments
Open

Performance : speeding up pickling of cftime arrays #253

aulemahal opened this issue Aug 10, 2021 · 7 comments

Comments

@aulemahal
Copy link

I used cftime 1.4.1 and 1.5.0 when exploring this.

My worklfows involve large datasets and complex functions. I use xarray, backed-up by dask. In one of the more complex processing, I use xarray's map_blocks and a handful of other dask-lazy methods on a large dataset that uses the NoLeap calendar. The dataset is large with 950 chunks and a 55114-element time coordinate. It seems a lot of time is spent in pickling the latter.

More precisely, this line of dask : https://github.com/dask/dask/blob/1c4a84225d1bd26e58d716d2844190cc23ebcfec/dask/base.py#L1028 calls pickle.dumps on the numpy array of type O that stores the cftime.Datetime objects.

When profiling the graph creation (no computation triggered yet), I can see that this step is the one that takes the most time. Slightly more than another function in xarray's CFTimeIndex creation.

MWE:

import pickle
import numpy as np
import pandas as pd
import xarray as xr


cft = xr.cftime_range('1950-01-01', '2100-01-01', freq='D')  # cftime array
npy = pd.date_range('1950-01-01', '2100-01-01', freq='D')  # same shape but numpy's datetime objects
oar = np.array([1] * npy.size, dtype='O')  # sanity check, normal array of object dtype, but builtin element type

timeit calls in a notebook:
Capture d’écran de 2021-08-10 11-10-49

So even if it is normal that pickiling an object array is slower, the cftime array is still 2 orders of magnitude slower than a basic array. I am not very knowledgeable in how pickle works, but I believe something could be made to speed this up.

Any ideas?

@spencerkclark
Copy link
Collaborator

Very interesting. I agree it seems there is significant room for improvement. I think we may be able to follow a similar line of thought as to what led to speedups in the creation of cftime objects (see discussion in pangeo-data/pangeo#764 (comment) and #158).

Pickling in cftime calls the _getstate method on each cftime.datetime instance. The kwargs dictionary it returns still includes the _dayofwk and _dayofyr attributes, which are slow to compute, and are not needed to reconstruct a datetime instance. I wonder if we simply deleted those from the kwargs dictionary if it would improve your performance metrics?

cftime/src/cftime/_cftime.pyx

Lines 1249 to 1250 in 4f28eb6

'dayofwk': self._dayofwk,
'dayofyr': self._dayofyr,

@spencerkclark
Copy link
Collaborator

Oops _dayofwk and _dayofyr are the cached values so they do not trigger computation. Therefore I suspect my proposed strategy will not make a difference.

@spencerkclark
Copy link
Collaborator

spencerkclark commented Aug 11, 2021

A fairer comparison might be to look at the performance of pickling an array of datetime.datetime objects. There we find pickling an array of cftime objects is ~3x slower -- still a meaningful difference, but not extreme:

In [1]: import pickle; import pandas as pd; import xarray as xr

In [2]: times = pd.date_range("1950-01-01", "2100-01-01", freq="D")

In [3]: datetimes = times.to_pydatetime()

In [4]: %timeit serialized = pickle.dumps(datetimes)
36.1 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: cftimes = xr.cftime_range("1950-01-01", "2100-01-01", freq="D").values

In [6]: %timeit serialized = pickle.dumps(cftimes)
102 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Unfortunately nothing obvious now sticks out to me to close this gap.

np.datetime64 values are stored in memory as single integers -- I think a np.datetime64 array is basically a view of an integer dtype array -- so it is not a surprise that pickling an array of those is much faster.

@aulemahal
Copy link
Author

aulemahal commented Aug 11, 2021

Indeed, I am not surprised that it takes more time than a datetime64 array.

While this might not be the best place, @spencerkclark, you might able to give an opinion. This problem happens in xarray workflows when using map_blocks because of dask.base.tokenize. The latter tries to obtain a deterministic hash of the array. As it is a 'O' dtype ndarray, the current way in dask is to pickle it. Thus, speeding up the pickling would be the most general way of optimizing this, but the usecase is rather narrow. Do you think we could add something specific in xarray.map_blocks to work around the performance issue instead? You can look into the xclim issue I linked above for an idea (transform cftime arrays to integers and then back to cftime within the wrapper).

@spencerkclark
Copy link
Collaborator

spencerkclark commented Aug 11, 2021

I'm somewhat surprised that encoding a cftime array is faster than pickling it (encoding requires repeated timedelta arithmetic, which is not needed for pickling). Have you done timing experiments to demonstrate this?

Or is the issue that dask.base.tokenize is called many times and it's better to take the up-front cost of converting to integers first?

@aulemahal
Copy link
Author

Exactly, the difference is in scale. Let's say we have a 3D (spatial+temporal) array divided in 100 spatial chunks (chunks={'time': -1}). Currently a call to map_blocks will create 100 tasks. For each tasks the full time coordinate will be tokenized. That's 100 calls to pickle.dumps().
When encoding the time coordinate before, that's only one call to cftime.date2num and 100 calls to tokenize for an integer ndarray. Above I have calculated that tokenizing a cftime array is 1000x slower than tokenizing a integer array.

@spencerkclark
Copy link
Collaborator

Ah...that makes perfect sense now, thanks. Indeed it does seem like the optimization might best take place before cftime is involved. If you can put together a simple example that demonstrates this performance bottleneck it might be interesting to get folks' thoughts in an xarray issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants