Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: use public pandas API in dataframe.empty #571

Merged
merged 3 commits into from
Mar 16, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 22 additions & 20 deletions fastparquet/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,31 +163,33 @@ def set_cats(values, i=i, col=col, **kwargs):

axes = [df._data.axes[0], index]

# allocate and create blocks
blocks = []
for block in df._data.blocks:
if block.is_categorical:
categories = block.values.categories
code = np.zeros(shape=size, dtype=block.values.codes.dtype)
values = Categorical(values=code, categories=categories,
# Patch our blocks with desired-length arrays. Kids: don't try this at home.
mgr = df._data
for block in mgr.blocks:
bvalues = block.values
shape = list(bvalues.shape)
shape[-1] = size

if isinstance(bvalues, Categorical):
categories = bvalues.categories
code = np.zeros(shape=shape, dtype=bvalues.codes.dtype)

values = Categorical(values=code, dtype=bvalues.dtype,
fastpath=True)
new_block = block.make_block_same_class(values=values)
elif getattr(block.dtype, 'tz', None):
new_shape = (size, )
values = np.empty(shape=new_shape, dtype='M8[ns]')
new_block = block.make_block_same_class(
type(block.values)(values, dtype=block.values.dtype)
)

elif getattr(bvalues.dtype, 'tz', None):
values = np.empty(shape=shape, dtype='M8[ns]')
values = type(bvalues)(values, dtype=bvalues.dtype)
else:
new_shape = (block.values.shape[0], size)
values = np.empty(shape=new_shape, dtype=block.values.dtype)
new_block = block.make_block_same_class(values=values)
# Note: this will break on any ExtensionDtype other than
# Categorical and DatetimeTZ
values = np.empty(shape=shape, dtype=bvalues.dtype)
Copy link
Author

@jbrockmendel jbrockmendel May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the fix to pandas-dev/pandas#41366 (comment) is to add here

            if not isinstance(bvalues, np.ndarray):
                # e.g. DatetimeLikeBlock backed by DatetimeArray/TimedeltaArray
                values = type(bvalues)._from_sequence(values)

i'll try to reproduce the problem locally after some caffeine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it. Happy to push a point release to fix pandas' CI, if needed.

I hope the line above doesn't make a copy!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if its a 3rd-party EA then there's no telling, but for the dt64 case this should be copy-free

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reproduced the failure locally and the edit above fixes 2 of the 8 tests. the others look pytz-related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?

No, correctness wouldn't be affected.


blocks.append(new_block)
block.values = values

# create block manager
df = DataFrame(BlockManager(blocks, axes))
mgr.axes[-1] = index

# create block manager
# create views
for block in df._data.blocks:
dtype = block.dtype
Expand Down