Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canonicalization of chunked ExtensionArray #499

Merged
merged 10 commits into from
Jul 24, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jul 22, 2024

Since #346, we've been canonicalizing chunked ExtensionArray's incorrectly. Rather than unwrapping each chunk to its backing storage array we've just been building a new chunkedarray of the ExtensionArray chunks.

Before #480 , this was actually causing DateTimePartsCompressor to fail in can_compress, so we weren't going down the compression codepath.

The Fix

Fix into_canonical so that when we encounter a ChunkedArray(ExtensionArray(storage)) to yield an ExtensionArray(ChunkedArray(storage)) where each chunk is canonicalized first (e.g. if you have chunks of DateTimePartsArray you will end up with chunks of ExtensionArray(i64)).

Fix the DateTimePartsCompressor to canonicalize its input, to handle that case where it may be a ChunkedArray.

@a10y a10y force-pushed the aduffy/compress-ext-chunks branch from 991a48e to ef6e6d9 Compare July 23, 2024 19:38
@a10y a10y marked this pull request as ready for review July 23, 2024 19:47
@a10y a10y changed the title Fix some things with extension arrays fix: canonicalization of chunked ExtensionArray Jul 23, 2024
encodings/datetime-parts/src/array.rs Outdated Show resolved Hide resolved
vortex-array/src/array/chunked/canonical.rs Show resolved Hide resolved
@@ -28,6 +28,10 @@ pub(crate) fn try_canonicalize_chunks(
chunks: Vec<Array>,
dtype: &DType,
) -> VortexResult<Canonical> {
if chunks.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a FLUP to create Canonical::empty(dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up redesigning extension arrays to contain their storage_dtype as part of the type definition, i'm not sure we'd need it

vortex-array/src/array/chunked/canonical.rs Show resolved Hide resolved
@a10y
Copy link
Contributor Author

a10y commented Jul 24, 2024

Per discussion on Slack yesterday, I updated the branch to have canonicalization behavior Chunked(Extension(storage)) -> Extension(Chunked(storage)). As part of that, I added call to .into_canonical() for the DateTimePartsCompressor.

@robert3005
Copy link
Member

Should change the commit message since the behaviour is different than what we started with

@a10y a10y merged commit 71e446f into develop Jul 24, 2024
2 checks passed
@a10y a10y deleted the aduffy/compress-ext-chunks branch July 24, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants