Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove AsContiguousFn #346

Merged
merged 14 commits into from
Jun 11, 2024
Merged

Remove AsContiguousFn #346

merged 14 commits into from
Jun 11, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jun 10, 2024

Per the comments on #341 , this PR removes the AsContiguousFn trait and as_contiguous function entirely, and instead pushes the relevant details into the implementation of ArrayFlatten::flatten for the ChunkedArray type.

A few questions I've bumped into while doing this refactor:

  • Can DType::List only reference primitive types? I see ListScalar, but it’s not clear to me how Vortex would encode e.g. List<Struct> or List<List>
    • Answer: Deferring for now as List DTypes not fully support yet anyway
  • How should this work for ExtensionArray?
    • Answer: ExtensionArray can contain a ChunkedArray for its internal storage Array

vortex-array/src/array/chunked/flatten.rs Outdated Show resolved Hide resolved
vortex-array/src/array/chunked/flatten.rs Show resolved Hide resolved
vortex-array/src/array/chunked/flatten.rs Outdated Show resolved Hide resolved
vortex-array/src/array/chunked/flatten.rs Outdated Show resolved Hide resolved
for (field_idx, field_dtype) in field_dtypes.iter().enumerate() {
let mut field_chunks = Vec::new();
for chunk in &chunks {
field_chunks.push(chunk.field(field_idx).expect("structarray should contain field"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expect is only really valid if you assert all dtypes are equal at the start of this function. Maybe you check that elsewhere, in which case just document the assumed precondition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by document do you mean in the expect() string, or document as a comment and then call unwrap instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assumed you meant the former since it seems there's a general preference for expect over unwrap, in which case this is done

vortex-array/src/array/chunked/flatten.rs Outdated Show resolved Hide resolved
vortex-array/src/array/chunked/flatten.rs Show resolved Hide resolved
vortex-array/src/array/chunked/flatten.rs Outdated Show resolved Hide resolved
vortex-array/src/array/chunked/mod.rs Outdated Show resolved Hide resolved
// Else, construct the boolean buffer
let mut buffer = BooleanBufferBuilder::new(validities.iter().map(|v| v.len()).sum());
for validity in validities {
let present = validity.to_present_null_buffer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a fan of this to_present_null_buffer function, it can very easily lead to unnecessary allocations. Can just match on the returned optional instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so if I remove the use of to_present_null_buffer, I pretty much have to write

            let present = match validity {
                LogicalValidity::AllValid(count) => {
                    BooleanBuffer::new_set(count)
                }
                LogicalValidity::AllInvalid(count) => {
                    BooleanBuffer::new_unset(count)
                }
                LogicalValidity::Array(array) => {
                    array.into_array().flatten_bool().expect("validity must flatten to BoolArray").boolean_buffer()
                }
            };
            buffer.append_buffer(&present);

Which is basically identical to the implementation of to_present_null_buffer, except without the wrapping NullBuffer. I'm not sure where the unnecessary allocations happen

@CLAassistant
Copy link

CLAassistant commented Jun 11, 2024

CLA assistant check
All committers have signed the CLA.

@@ -29,6 +30,11 @@ impl EncodingCompression for FoREncoding {
return None;
}

// For all-null, cannot encode.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, arrays that have all-nulls would fail on the compute_as_cast a few lines below

@a10y a10y changed the title [WIP] Replace AsContiguous Remove AsContiguousFn Jun 11, 2024
@a10y
Copy link
Contributor Author

a10y commented Jun 11, 2024

Just pushed one more patch

  • Fix SparseArray's flatten implementation. Previously it only worked for primitive types, but SparseArray's logicalvalidity is a bool sparsearray. Flattening on that works as expected now
  • Add AsArrowArray impl for ConstantArray so that we can send a flattened null ConstantArray to arrow like the other Flattened variants

@@ -11,6 +12,7 @@ use crate::{Array, IntoArray};

/// The set of encodings that can be converted to Arrow with zero-copy.
pub enum Flattened {
Null(ConstantArray),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for now, we may just want a FLUP issue to add a NullArray so we're 1:1 with Arrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: #348

@a10y a10y merged commit e8e5557 into aduffy/fix-bench Jun 11, 2024
2 checks passed
@a10y a10y deleted the aduffy/contiguous-af branch June 11, 2024 17:46
a10y added a commit that referenced this pull request Jul 24, 2024
Since #346, we've been canonicalizing chunked ExtensionArray's
incorrectly. Rather than unwrapping each chunk to its backing storage
array we've just been building a new chunkedarray of the ExtensionArray
chunks.

Before #480 , this was actually causing DateTimePartsCompressor to fail
in can_compress, so we weren't going down the compression codepath.

## The Fix

Fix `into_canonical` so that when we encounter a
`ChunkedArray(ExtensionArray(storage))` to yield an
`ExtensionArray(ChunkedArray(storage))` where each chunk is
canonicalized first (e.g. if you have chunks of DateTimePartsArray you
will end up with chunks of ExtensionArray(i64)).

Fix the DateTimePartsCompressor to canonicalize its input, to handle
that case where it may be a ChunkedArray.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants