Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas 1.3.0 compatibility #48

Closed
jrbourbeau opened this issue Apr 2, 2021 · 1 comment · Fixed by #49
Closed

Pandas 1.3.0 compatibility #48

jrbourbeau opened this issue Apr 2, 2021 · 1 comment · Fixed by #49

Comments

@jrbourbeau
Copy link
Member

The partd test suite fails with the nightly version of pandas, which you can install with

python -m pip install --no-deps --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas

I've included the test failure tracebacks below.

Test failures:
================================================================================== FAILURES ===================================================================================
____________________________________________________________________________ test_serialize[base0] ____________________________________________________________________________

base = Timestamp('1987-03-03 01:01:01+0001', tz='pytz.FixedOffset(1)')

    @pytest.mark.parametrize('base', [
        pd.Timestamp('1987-03-3T01:01:01+0001'),
        pd.Timestamp('1987-03-03 01:01:01-0600', tz='US/Central'),
    ])
    def test_serialize(base):
        df = pd.DataFrame({'x': [
            base + pd.Timedelta(seconds=i)
            for i in np.random.randint(0, 1000, size=10)],
                           'y': list(range(10)),
                           'z': pd.date_range('2017', periods=10)})
>       df2 = deserialize(serialize(df))

partd/tests/test_pandas.py:110:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
partd/pandas.py:175: in serialize
    h, b = block_to_header_bytes(block)
partd/pandas.py:141: in block_to_header_bytes
    bytes = pnp.compress(pnp.serialize(values), values.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = <DatetimeArray>
[
['2017-01-01 00:00:00', '2017-01-02 00:00:00', '2017-01-03 00:00:00',
 '2017-01-04 00:00:00', '2017-...0:00:00', '2017-01-08 00:00:00', '2017-01-09 00:00:00',
 '2017-01-10 00:00:00']
]
Shape: (1, 10), dtype: datetime64[ns]

    def serialize(x):
        if x.dtype == 'O':
            l = x.flatten().tolist()
            with ignoring(Exception):  # Try msgpack (faster on strings)
                return frame(msgpack.packb(l, use_bin_type=True))
            return frame(pickle.dumps(l, protocol=pickle.HIGHEST_PROTOCOL))
        else:
>           return x.tobytes()
E           AttributeError: 'DatetimeArray' object has no attribute 'tobytes'

partd/numpy.py:101: AttributeError
____________________________________________________________________________ test_serialize[base1] ____________________________________________________________________________

base = Timestamp('1987-03-03 01:01:01-0600', tz='US/Central')

    @pytest.mark.parametrize('base', [
        pd.Timestamp('1987-03-3T01:01:01+0001'),
        pd.Timestamp('1987-03-03 01:01:01-0600', tz='US/Central'),
    ])
    def test_serialize(base):
        df = pd.DataFrame({'x': [
            base + pd.Timedelta(seconds=i)
            for i in np.random.randint(0, 1000, size=10)],
                           'y': list(range(10)),
                           'z': pd.date_range('2017', periods=10)})
>       df2 = deserialize(serialize(df))

partd/tests/test_pandas.py:110:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
partd/pandas.py:175: in serialize
    h, b = block_to_header_bytes(block)
partd/pandas.py:141: in block_to_header_bytes
    bytes = pnp.compress(pnp.serialize(values), values.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = <DatetimeArray>
[
['2017-01-01 00:00:00', '2017-01-02 00:00:00', '2017-01-03 00:00:00',
 '2017-01-04 00:00:00', '2017-...0:00:00', '2017-01-08 00:00:00', '2017-01-09 00:00:00',
 '2017-01-10 00:00:00']
]
Shape: (1, 10), dtype: datetime64[ns]

    def serialize(x):
        if x.dtype == 'O':
            l = x.flatten().tolist()
            with ignoring(Exception):  # Try msgpack (faster on strings)
                return frame(msgpack.packb(l, use_bin_type=True))
            return frame(pickle.dumps(l, protocol=pickle.HIGHEST_PROTOCOL))
        else:
>           return x.tobytes()
E           AttributeError: 'DatetimeArray' object has no attribute 'tobytes'

partd/numpy.py:101: AttributeError

From digging around a bit it looks like we've been relying on some pandas internals, specifically DataFrame._data.blocks, which has experienced some changes upstream (xref pandas-dev/pandas#39146).

cc @jorisvandenbossche in case you have any thoughts. I started making a few changes locally which are similar to dask/dask#7318, but haven't gotten things to fully work yet.

@jorisvandenbossche
Copy link
Member

So what changed here in pandas is that for datetime data, the Block.values no longer is a datetime64[ns] np.ndarray, but now a pandas DatetimeArray. The only problem with that is that the dtype attribute of the DatetimeArray class still is a datetime64[ns] np.dtype, so that the is_extension_array_dtype doesn't work as expected:

partd/partd/pandas.py

Lines 132 to 135 in 4621f94

elif is_extension_array_dtype(block.dtype):
extension = ("other", ())
else:
extension = ('numpy_type', ())

I am not directly sure what the best (public) way is to check this instead. Probably directly checking whether block.values is an ExtensionArray instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants