Support Parquet `BYTE_STREAM_SPLIT` for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types #6159

etseidl · 2024-07-30T17:56:20Z

Which issue does this PR close?

Closes #6048.

Rationale for this change

BYTE_STREAM_SPLIT encoding was recently expanded to include all fixed-width primitive types (primarily to support the Float16 logical type, but it has been found to be beneficial for integer types as well).

What changes are included in this PR?

The biggest change is adding the type_length from the Parquet schema to the encoder and decoder interface. This is necessary to handle FIXED_LEN_BYTE_ARRAY data.

Are there any user-facing changes?

Adds new data types to an existing encoding. May require additional documentation.

etseidl · 2024-07-30T18:03:55Z

Note to reviewers: I could use some help for the following: https://github.com/etseidl/arrow-rs/blob/1c1af32c097df124d00e0ecf84ae72fdc629e250/parquet/src/encodings/decoding/byte_stream_split_decoder.rs#L151-L158

IIUC, slice_as_bytes cannot be used for the FLBA data because the bytes may not all be contiguous in memory. So on decode this was the only way I could come up with to get the decoded bytes into their proper location in the output.

etseidl · 2024-07-30T19:56:41Z

I have verified that https://github.com/apache/parquet-testing/blob/master/data/byte_stream_split_extended.gzip.parquet can be read properly by both parquet-read and parquet-rewrite (and a modified parquet-rewrite can round trip properly).

alamb · 2024-07-30T20:01:49Z

I have verified that https://github.com/apache/parquet-testing/blob/master/data/byte_stream_split_extended.gzip.parquet can be read properly by both parquet-read and parquet-rewrite (and a modified parquet-rewrite can round trip properly).

🤔 it would be great to add some sort of test that shows this -- I was hoping we already had tests that read parquet files and verified the results, but sadly it appears we do not.

I suppose this is one of the things that the parquet compatibility tests I proposed on apache/parquet-format#441 would handle

alamb · 2024-07-30T20:02:27Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -1641,6 +1643,86 @@ mod tests {
        assert_eq!(row_count, 300);
    }

+    #[test]
+    fn test_read_extended_byte_stream_split() {
+        let path = format!(


👍 I see here we did have the test 👍

Yes, that tests one path, but this bypasses the BSS decoder in encodings::decoding::byte_stream_split_decoder. parquet-read exercises that path, so I hope to recreate that path (goes through serialized file reader) in an additional test.

I see this test implements the suggestion from https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types

To check conformance of a BYTE_STREAM_SPLIT decoder, read each BYTE_STREAM_SPLIT-encoded column and compare the decoded values against the values from the corresponding PLAIN-encoded column. The values should be equal.

However, when I double checked the vaues with what pyarrow python says they didn't seem to match 🤔

I printed out the f16 column:

f16_col: PrimitiveArray<Float16> [ 10.3046875, 8.9609375, 10.75, 10.9375, 8.046875, 8.6953125, 10.125, 9.6875, 9.984375, 9.1484375, ...108 elements..., 11.6015625, 9.7578125, 8.9765625, 10.1796875, 10.21875, 11.359375, 10.8359375, 10.359375, 11.4609375, 8.8125, ]

f32_col: PrimitiveArray<Float32> [ 8.827992, 9.48172, 11.511229, 10.637534, 9.301069, 8.986282, 10.032783, 8.78344, 9.328859, 10.31201, ...52 elements..., 7.6898966, 10.054354, 9.528224, 10.459386, 10.701954, 10.138242, 10.760133, 10.229212, 10.530065, 9.295327, ]

Here is what python told me:

Python 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet as pq >>> table = pq.read_table('byte_stream_split_extended.gzip.parquet') >>> table pyarrow.Table float16_plain: halffloat float16_byte_stream_split: halffloat float_plain: float float_byte_stream_split: float double_plain: double double_byte_stream_split: double int32_plain: int32 int32_byte_stream_split: int32 int64_plain: int64 int64_byte_stream_split: int64 flba5_plain: fixed_size_binary[5] flba5_byte_stream_split: fixed_size_binary[5] decimal_plain: decimal128(7, 3) decimal_byte_stream_split: decimal128(7, 3) ---- float16_plain: [[18727,18555,18784,18808,18438,...,18573,18770,18637,18687,18667]] float16_byte_stream_split: [[18727,18555,18784,18808,18438,...,18573,18770,18637,18687,18667]] float_plain: [[10.337575,11.407482,10.090585,10.643939,7.9498277,...,10.138242,10.760133,10.229212,10.530065,9.295327]] float_byte_stream_split: [[10.337575,11.407482,10.090585,10.643939,7.9498277,...,10.138242,10.760133,10.229212,10.530065,9.295327]] double_plain: [[9.82038858616854,10.196776096656958,10.820528475417419,9.606258827775427,10.521167255732113,...,9.576393393539162,9.47941158714459,10.812601287753644,10.241659719820838,8.225037940357872]] double_byte_stream_split: [[9.82038858616854,10.196776096656958,10.820528475417419,9.606258827775427,10.521167255732113,...,9.576393393539162,9.47941158714459,10.812601287753644,10.241659719820838,8.225037940357872]] int32_plain: [[24191,41157,7403,79368,64983,...,3584,93802,95977,73925,10300]] int32_byte_stream_split: [[24191,41157,7403,79368,64983,...,3584,93802,95977,73925,10300]] int64_plain: [[293650000000,41079000000,51248000000,246066000000,572141000000,...,294755000000,343501000000,663621000000,976709000000,836245000000]] int64_byte_stream_split: [[293650000000,41079000000,51248000000,246066000000,572141000000,...,294755000000,343501000000,663621000000,976709000000,836245000000]]

the pyarrow output looks like my parquet-read output (with the exception of the f16 columns). I'm not sure what happened with the f32_col above, but I did find those values further down in the output. Weird batching?

The fact that the existing, non BSS columns (not changes by this PR) come back the same gives me confidence that the code is doing the right thing. I just found it straange that python seemed to give me a different result

etseidl · 2024-07-30T20:06:16Z

I suppose this is one of the things that the parquet compatibility tests I proposed on apache/parquet-format#441 would handle

Yes! ❤️

alamb

Thank you @etseidl

I got some wonky results of reading, which I don't really understand (maybe I did something wrong)

I am also not sure about the changes to impl<T: DataType> Decoder<T> for ByteStreamSplitDecoder<T> { but I left some suggestions / comments.

alamb · 2024-07-30T20:16:28Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -1641,6 +1643,86 @@ mod tests {
        assert_eq!(row_count, 300);
    }

+    #[test]
+    fn test_read_extended_byte_stream_split() {
+        let path = format!(


I see this test implements the suggestion from https://github.com/apache/parquet-testing/blob/master/data/README.md#additional-types

To check conformance of a BYTE_STREAM_SPLIT decoder, read each BYTE_STREAM_SPLIT-encoded column and compare the decoded values against the values from the corresponding PLAIN-encoded column. The values should be equal.

However, when I double checked the vaues with what pyarrow python says they didn't seem to match 🤔

I printed out the f16 column:

f16_col: PrimitiveArray<Float16> [ 10.3046875, 8.9609375, 10.75, 10.9375, 8.046875, 8.6953125, 10.125, 9.6875, 9.984375, 9.1484375, ...108 elements..., 11.6015625, 9.7578125, 8.9765625, 10.1796875, 10.21875, 11.359375, 10.8359375, 10.359375, 11.4609375, 8.8125, ]

f32_col: PrimitiveArray<Float32> [ 8.827992, 9.48172, 11.511229, 10.637534, 9.301069, 8.986282, 10.032783, 8.78344, 9.328859, 10.31201, ...52 elements..., 7.6898966, 10.054354, 9.528224, 10.459386, 10.701954, 10.138242, 10.760133, 10.229212, 10.530065, 9.295327, ]

Here is what python told me:

Python 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.parquet as pq >>> table = pq.read_table('byte_stream_split_extended.gzip.parquet') >>> table pyarrow.Table float16_plain: halffloat float16_byte_stream_split: halffloat float_plain: float float_byte_stream_split: float double_plain: double double_byte_stream_split: double int32_plain: int32 int32_byte_stream_split: int32 int64_plain: int64 int64_byte_stream_split: int64 flba5_plain: fixed_size_binary[5] flba5_byte_stream_split: fixed_size_binary[5] decimal_plain: decimal128(7, 3) decimal_byte_stream_split: decimal128(7, 3) ---- float16_plain: [[18727,18555,18784,18808,18438,...,18573,18770,18637,18687,18667]] float16_byte_stream_split: [[18727,18555,18784,18808,18438,...,18573,18770,18637,18687,18667]] float_plain: [[10.337575,11.407482,10.090585,10.643939,7.9498277,...,10.138242,10.760133,10.229212,10.530065,9.295327]] float_byte_stream_split: [[10.337575,11.407482,10.090585,10.643939,7.9498277,...,10.138242,10.760133,10.229212,10.530065,9.295327]] double_plain: [[9.82038858616854,10.196776096656958,10.820528475417419,9.606258827775427,10.521167255732113,...,9.576393393539162,9.47941158714459,10.812601287753644,10.241659719820838,8.225037940357872]] double_byte_stream_split: [[9.82038858616854,10.196776096656958,10.820528475417419,9.606258827775427,10.521167255732113,...,9.576393393539162,9.47941158714459,10.812601287753644,10.241659719820838,8.225037940357872]] int32_plain: [[24191,41157,7403,79368,64983,...,3584,93802,95977,73925,10300]] int32_byte_stream_split: [[24191,41157,7403,79368,64983,...,3584,93802,95977,73925,10300]] int64_plain: [[293650000000,41079000000,51248000000,246066000000,572141000000,...,294755000000,343501000000,663621000000,976709000000,836245000000]] int64_byte_stream_split: [[293650000000,41079000000,51248000000,246066000000,572141000000,...,294755000000,343501000000,663621000000,976709000000,836245000000]]

parquet/src/encodings/decoding.rs

alamb · 2024-07-30T20:22:18Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

@@ -76,11 +94,32 @@ impl<T: DataType> Decoder<T> for ByteStreamSplitDecoder<T> {
        let num_values = buffer.len().min(total_remaining_values);
        let buffer = &mut buffer[..num_values];

+        let type_size = match T::get_physical_type() {


We can probably figure out some way to encode this in the trait -- either Decoder directly or maybe some function

What if we made ByteStreamSpitDecoder also be parameterized in the width in bytes:

impl<T: DataType, W: const usize> Decoder<T> for ByteStreamSplitDecoder<T, W> { ...

That woudl require knowing all the possible sizes (is that known aprior?)

For FIXED_LEN_BYTE_ARRAY the size can be anything :( That's why I had to add non-parameterized versions of the split_streams and join_streams. I added some additional likely cases (2 for FLOAT16, 16 for UUID), but for FLBA(5) there's not much you can do.

🤔 yeah -- maybe that is argument enough for creating a different decoder VariableSizedByteStreamSplitDecoder 🤔

Oh, that's a good idea. I was trying to specialize a ByteStreamSplitDecoder for FixedLenByteArrayType but that didn't work so well 😅. An entirely new decoder would get rid of all the janky casting and such.

alamb · 2024-07-30T20:22:53Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

        }
        self.values_decoded += num_values;

+        // FIXME(ets): there's got to be a better way to do this


Yeah, I agree this is the thing we should try and figure out

alamb · 2024-07-31T21:33:27Z

I ran out of time to give this another look today but will try to get it tomorrow

etseidl · 2024-08-01T16:26:28Z

I ran out of time to give this another look today but will try to get it tomorrow

Thanks @alamb, but no hurry. I'm still thinking about additional tests. Once you have a look at the decoder changes, let me know if you want the same split done for the encoding side (i.e. add a VariableWidthByteStreamSplitEncoder). I tried this out, but that wound up being a pretty big change because it requires passing a column descriptor when getting an Encoder.

…coder

alamb

Thank you @etseidl -- I went over this again and I think it looks very nice and I think we could merge it as is

The only thing I want to do before doing so is run some benchmarks to make sure it doesn't have some unexpected performance ramifications. I have started these off and will report back.

I tried this out, but that wound up being a pretty big change because it requires passing a column descriptor when getting an Encoder.

I don't understand this comment -- it looks like you implemented VariableWidthByteStreamSplitEncoder

alamb · 2024-08-02T16:25:54Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -1641,6 +1643,86 @@ mod tests {
        assert_eq!(row_count, 300);
    }

+    #[test]
+    fn test_read_extended_byte_stream_split() {
+        let path = format!(


The fact that the existing, non BSS columns (not changes by this PR) come back the same gives me confidence that the code is doing the right thing. I just found it straange that python seemed to give me a different result

alamb · 2024-08-02T16:41:12Z

parquet/src/encodings/decoding.rs

@@ -27,6 +27,9 @@ use super::rle::RleDecoder;
 use crate::basic::*;
 use crate::data_type::private::ParquetValueType;
 use crate::data_type::*;
+use crate::encodings::decoding::byte_stream_split_decoder::{
+    ByteStreamSplitDecoder, VariableWidthByteStreamSplitDecoder,


I think this is a good model -- to have a ByteStreamSplit deocder and a VariableWidthByteStreamSplitDecoder (I realize I also partly suggested it but I like how it looks)

Thanks again! It was a good idea 😄

alamb · 2024-08-02T16:41:56Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

@@ -62,6 +63,22 @@ fn join_streams_const<const TYPE_SIZE: usize>(
    }
 }

+// Like the above, but type_size is not known at compile time.


maybe it could be called join_streams_variable to match the name of the decoder

alamb · 2024-08-02T16:43:12Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

+    fn set_data(&mut self, data: Bytes, num_values: usize) -> Result<()> {
+        // Rough check that all data elements are the same length
+        if data.len() % self.type_width != 0 {
+            return Err(general_err!("Input data is not of fixed length"));


Could we please make this slightly more informative -- something like data length {} is not a multiple of type width {}

alamb · 2024-08-02T16:45:00Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

+
+        let stride = self.encoded_bytes.len() / type_size;
+        match type_size {
+            2 => join_streams_const::<2>(


since this is the variable length decoder, is there any reason to generate code for these special case lengths (2, 4, ...)? As in it could simply call join_streams directly 🤔

I don't think it would be all that bad but it also may be unecessary

My assumption (unproven) is that the parameterized join_streams is faster. So the special cases are for known logical types that use FLBA as the physical type (although I should probably remove 4 and 8). If there is no advantage, then yes, the variable width decoder should just use the non-parameterized version (and perhaps the parameterized version could just go away).

I think this is fine to keep

I modified the benchmark to work with f16 as FixedLenByteArray(2). Good news is that using the templated _static variants is significantly faster for both encode and decode.

The bad news is that FixedLenByteArray is very slow. This is not shocking due to the need for so many buffer copies. Get it working, then get it working fast, right? 😉

% cargo bench -p parquet --bench encoding --all-features -- --baseline bssopt Compiling parquet v52.2.0 (/Users/seidl/src/arrow-rs/parquet) Finished `bench` profile [optimized] target(s) in 45.92s Running benches/encoding.rs (target/release/deps/encoding-534b69246994059e) encoding: dtype=parquet::data_type::FixedLenByteArray, encoding=BYTE_STREAM_SPLIT time: [142.70 µs 144.02 µs 145.71 µs] change: [+26.586% +27.751% +29.264%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe dtype=parquet::data_type::FixedLenByteArray, encoding=BYTE_STREAM_SPLIT encoded as 32768 bytes decoding: dtype=parquet::data_type::FixedLenByteArray, encoding=BYTE_STREAM_SPLIT time: [392.38 µs 393.06 µs 393.77 µs] change: [+2.0067% +2.6708% +3.2941%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT time: [44.729 µs 46.314 µs 49.430 µs] change: [-4.2202% -1.5434% +2.7807%] (p = 0.56 > 0.05) No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe dtype=f32, encoding=BYTE_STREAM_SPLIT encoded as 65536 bytes decoding: dtype=f32, encoding=BYTE_STREAM_SPLIT time: [38.613 µs 38.697 µs 38.784 µs] change: [-0.0019% +0.5500% +1.0931%] (p = 0.06 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT time: [108.86 µs 109.25 µs 109.66 µs] change: [-4.3488% -3.0332% -1.8343%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe dtype=f64, encoding=BYTE_STREAM_SPLIT encoded as 131072 bytes decoding: dtype=f64, encoding=BYTE_STREAM_SPLIT time: [81.127 µs 81.343 µs 81.566 µs] change: [-3.6443% -2.9616% -2.2527%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe

The bad news is that FixedLenByteArray is very slow. This is not shocking due to the need for so many buffer copies. Get it working, then get it working fast, right? 😉

Yes, I think that is right

Given what I saw of the code, the fact that decoding FixedLengthByteArray as individual Buffers (which each have an offset + length + arc) is going to be pretty slow.

In other words, I don't think the FixedLengthByteArray slowness is due anything speicific with BYTE_STREAM_SPLIT. If we wanted to make it faster we would likely have to change how the parquet docoder represents the type

For example, we would likely not use this type: https://docs.rs/parquet/latest/parquet/data_type/struct.FixedLenByteArray.html

The ArrowReader has a bunch of specialized implementations for certain various array types to write directly into the arrow implementation (rather than the parquet types and then to the arrow types).

If anyone cares about reading fixed length binary more quickly from parquet it is probably good to take a look at how to optimize that more quickly (FYI @samuelcolvin and @westonpace who I think have been looking at FixedWidthBinary recently)

Plus the numbers here are for the row-based reader IIUC, so there shouldn't be much expectation of high performance anyway. And hopefully users won't be picking this encoding for FLBA very often. The motivating use case for this was Float16, but perhaps some small decimals encoded with FLBA would benefit as well.

One other in-the-weeds consideration is how cache unfriendly this encoding is. If you think of PLAIN data as a num_vals X type_width matrix, BSS is transposing that matrix. If type_width gets too large, there will be cache misses galore without some type of blocking during the transpose operation.

row-based reader IIUC

This is the key observation here, the row based reader is forced to perform an allocation for every value read and this is absolutely catastrophic from a performance standpoint. As @alamb points out the arrow readers have optimised codepaths for these types that avoid this issue, and should be preferred in use-cases that care about performance. We could probably document this more aggressively tbh...

alamb · 2024-08-02T16:50:54Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

+
+        // FIXME(ets): there's got to be a better way to do this
+        for i in 0..num_values {
+            if let Some(bi) = buffer[i].as_mut_any().downcast_mut::<FixedLenByteArray>() {


Shouldn't this error/panic if the value isn't actually a FixexLenByteArary?

Something like

let bi = buffer[i].as_mut_any().downcast_mut::<FixedLenByteArray>() .expect("Decoding fixed length byte array");

hang on, I have a better idea...

So this is what I came up with:

// create a buffer from the vec so far (and leave a new Vec in its place) let vec_with_data = std::mem::take(&mut tmp_vec); // convert Vec to Bytes (which is a ref counted wrapper) let bytes_with_data = Bytes::from(vec_with_data); for (i, bi) in buffer.iter_mut().enumerate().take(num_values) { // Get a view into the data, without also copying the bytes let data = bytes_with_data.slice(i * type_size..(i + 1) * type_size); let bi = bi.as_mut_any() .downcast_mut::<FixedLenByteArray>() .expect("Decoding fixed length byte array"); bi.set_data(data); }

I think it avoids a bunch of allocations (only does one allocation for each batch) but it is still pretty bad in terms of the downcast_mut stuff 🤮. I suspect we would need to add some other trait method to DataType (like set_from_bytes or something to make it work

alamb · 2024-08-02T16:51:23Z

parquet/src/encodings/encoding/byte_stream_split_encoder.rs

@@ -53,13 +52,24 @@ fn split_streams_const<const TYPE_SIZE: usize>(src: &[u8], dst: &mut [u8]) {
    }
 }

+// Like above, but type_size is not known at compile time.
+fn split_streams(src: &[u8], dst: &mut [u8], type_size: usize) {


ditto here -- maybe

Suggested change

fn split_streams(src: &[u8], dst: &mut [u8], type_size: usize) {

fn split_streams_variable(src: &[u8], dst: &mut [u8], type_size: usize) {

etseidl · 2024-08-02T16:57:57Z

I tried this out, but that wound up being a pretty big change because it requires passing a column descriptor when getting an Encoder.

I don't understand this comment -- it looks like you implemented VariableWidthByteStreamSplitEncoder

Should have gone back and edited...after making some changes for the test code, I figured most of the changes needed were already present, so I went ahead and remove the byte width from the Encoder interface. Sorry about that.

alamb

I ran the benchmarks and I see this which suggests that this branch is slower than master somehow for f32. I will rerun to see if I can reproduce the results

group                                              bss                                    master
-----                                              ---                                    ------
decoding: dtype=f32, encoding=BYTE_STREAM_SPLIT    1.00     30.5±0.03µs        ? ?/sec    1.00     30.5±0.04µs        ? ?/sec
decoding: dtype=f64, encoding=BYTE_STREAM_SPLIT    1.00     65.3±0.06µs        ? ?/sec    1.00     65.3±0.10µs        ? ?/sec
encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT    1.22     41.6±0.03µs        ? ?/sec    1.00     34.2±0.05µs        ? ?/sec
encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT    1.01     95.1±0.79µs        ? ?/sec    1.00     93.8±0.72µs        ? ?/sec

And the next run shows the same result somehow 🤔

group                                              bss                                    master
-----                                              ---                                    ------
decoding: dtype=f32, encoding=BYTE_STREAM_SPLIT    1.00     30.5±0.13µs        ? ?/sec    1.00     30.5±0.04µs        ? ?/sec
decoding: dtype=f64, encoding=BYTE_STREAM_SPLIT    1.00     65.2±0.05µs        ? ?/sec    1.00     65.3±0.05µs        ? ?/sec
encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT    1.23     41.6±0.09µs        ? ?/sec    1.00     33.7±0.25µs        ? ?/sec
encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT    1.01     94.7±0.81µs        ? ?/sec    1.00     93.4±0.32µs        ? ?/sec

alamb · 2024-08-02T17:12:18Z

I ran this command to benchmark, btw:

cargo bench -p parquet --bench encoding --all-features -- --save-baseline master

etseidl · 2024-08-02T18:10:55Z

I ran the benchmarks and I see this which suggests that this branch is slower than master somehow for f32. I will rerun to see if I can reproduce the results

Odd. I'll play around some more, but on my laptop I'm not seeing the discrepancy. I'll try on some different hardware.

alamb · 2024-08-02T18:23:21Z

Odd. I'll play around some more, but on my laptop I'm not seeing the discrepancy. I'll try on some different hardware.

I'll double check too

etseidl · 2024-08-02T18:51:59Z

On my workstation I'm seeing pretty consistent numbers for f32, but the new f64 encode is around 1-2% slower and f64 decode is pretty consistently 5% faster. I wonder if the benchmarks are just really sensitive to architecture.

alamb · 2024-08-02T19:45:08Z

On my workstation I'm seeing pretty consistent numbers for f32, but the new f64 encode is around 1-2% slower and f64 decode is pretty consistently 5% faster. I wonder if the benchmarks are just really sensitive to architecture.

Given the numbers are reported in usec and we didn't really change anything related to f32 decoding, I would tend to agree

alamb

I think this looks good to me. Thank you @etseidl

alamb · 2024-08-02T19:46:13Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

+
+        let stride = self.encoded_bytes.len() / type_size;
+        match type_size {
+            2 => join_streams_const::<2>(


I think this is fine to keep

alamb · 2024-08-02T19:46:42Z

parquet/src/encodings/decoding/byte_stream_split_decoder.rs

+        for (i, bi) in buffer.iter_mut().enumerate().take(num_values) {
+            // Get a view into the data, without also copying the bytes
+            let data = bytes_with_data.slice(i * type_size..(i + 1) * type_size);
+            // TODO: perhaps add a `set_from_bytes` method to `DataType` to avoid downcasting


Maybe @tustvold or @XiangpengHao has some suggestion on how to avoid this downcasting

etseidl · 2024-08-05T19:25:55Z

I've done some more performance tweaking. By reworking VariableWidthByteStreamSplitEncoder::put() I've managed to get some pretty good speed ups on the encoding side. This is comparing to a baseline of the current state of my bss branch. I've left in the float benches for comparison, and then have results for FixedLenByteArray(n) where n = 2, 4-8, 16.

encoding: dtype=f32, encoding=BYTE_STREAM_SPLIT
                        time:   [43.710 µs 43.941 µs 44.221 µs]
                        change: [-1.6776% -0.6826% +0.2648%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

encoding: dtype=f64, encoding=BYTE_STREAM_SPLIT
                        time:   [111.19 µs 111.97 µs 112.79 µs]
                        change: [-2.5753% -1.3409% -0.1116%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

encoding: dtype=parquet::data_type::FixedLenByteArray(2), encoding=BYTE_STREAM_SPLIT
                        time:   [49.573 µs 50.004 µs 50.432 µs]
                        change: [-53.988% -53.597% -53.183%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

encoding: dtype=parquet::data_type::FixedLenByteArray(4), encoding=BYTE_STREAM_SPLIT #2
                        time:   [84.666 µs 85.319 µs 86.056 µs]
                        change: [-44.200% -43.653% -43.183%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

encoding: dtype=parquet::data_type::FixedLenByteArray(5), encoding=BYTE_STREAM_SPLIT #3
                        time:   [108.97 µs 109.44 µs 110.03 µs]
                        change: [-38.164% -37.665% -37.185%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

encoding: dtype=parquet::data_type::FixedLenByteArray(6), encoding=BYTE_STREAM_SPLIT #4
                        time:   [128.91 µs 129.86 µs 130.99 µs]
                        change: [-32.994% -32.088% -31.191%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

encoding: dtype=parquet::data_type::FixedLenByteArray(7), encoding=BYTE_STREAM_SPLIT #5
                        time:   [157.03 µs 158.05 µs 159.18 µs]
                        change: [-29.519% -28.944% -28.346%] (p = 0.00 < 0.05)
                        Performance has improved.

encoding: dtype=parquet::data_type::FixedLenByteArray(8), encoding=BYTE_STREAM_SPLIT #6
                        time:   [168.02 µs 171.47 µs 176.56 µs]
                        change: [-6.5555% -5.5390% -4.2909%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

encoding: dtype=parquet::data_type::FixedLenByteArray(16), encoding=BYTE_STREAM_SPLIT #7
                        time:   [898.95 µs 900.20 µs 901.59 µs]
                        change: [-0.7839% -0.2549% +0.2553%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

The new code replaces the current put logic

values.iter().for_each(|x| {
    let bytes = x.as_bytes();
    ...
    self.buffer.extend(bytes)
});

with a parameterized function

fn put_fixed<T: DataType, const TYPE_SIZE: usize>(dst: &mut [u8], values: &[T::T]) {
    let mut idx = 0;
    values.iter().for_each(|x| {
        let bytes = x.as_bytes();
        ...
        for i in 0..TYPE_SIZE {
            dst[idx + i] = bytes[i]
        }
        idx += TYPE_SIZE;
    });
}

for n <= 8. Over 8 bytes it seems better to not use a loop (although the extend() is replaced with copy_from_slice()).

I'll push the new code once I have a roundtrip test to make sure it's working correctly. I also want to benchmark on a faster machine.

In a subsequent PR I think I'll try tackling a more cache friendly transpose for type_size > 8 to see if I can get the FLBA(16) numbers down some.

Edit: changing the loop to copy_from_slice knocked off another 10-20%.

fn put_fixed<T: DataType, const TYPE_SIZE: usize>(dst: &mut [u8], values: &[T::T]) {
    let mut idx = 0;
    values.iter().for_each(|x| {
        let bytes = x.as_bytes();
        ...
        dst[idx..(TYPE_SIZE + idx)].copy_from_slice(&bytes[..TYPE_SIZE]);
        idx += TYPE_SIZE;
    });
}

etseidl · 2024-08-05T23:52:44Z

I went ahead and optimized split_streams_variable. The time for FLBA(16) dropped a lot.

encoding: dtype=parquet::data_type::FixedLenByteArray, encoding=BYTE_STREAM_SPLIT #7
                        time:   [264.19 µs 267.64 µs 271.35 µs]
                        change: [-70.668% -70.084% -69.554%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

mapleFU · 2024-08-06T02:45:19Z

parquet/src/encodings/encoding/byte_stream_split_encoder.rs

+        // Now copy `values` into the buffer. For `type_width` <= 8 use a fixed size when
+        // performing the copy as it is significantly faster.
+        match self.type_width {
+            2 => put_fixed::<T, 2>(out_buf, values),


So if type_width == 1, still put_variable would be called?

Still wondering why type_width <= 8 by hande

I could throw 1 in, but FLBA(1) is kind of weird. An int would be much faster to deal with for a single byte. I suppose someone might be tempted to use it for a single (ASCII) character field...UTF8 would need multiple bytes anyway.

Still wondering why type_width <= 8 by hande

The reason for the special handling for 2-8 is shown by the benchmarks...those numbers are basically the current code vs using put_variable exclusively. For Float16 using put_fixed is more than 2X faster. The speed advantage pretty much goes away at type_width == 8.

I think it is ok, even if it is unlikely that 5 byte fixed length byte arrays are an important usecase

alamb

Thanks @etseidl and @mapleFU and others.

I think this PR is good enough to merge and keep iterating, if needed, on the master branch

alamb · 2024-08-06T12:46:11Z

parquet/src/encodings/encoding/byte_stream_split_encoder.rs

+        // Now copy `values` into the buffer. For `type_width` <= 8 use a fixed size when
+        // performing the copy as it is significantly faster.
+        match self.type_width {
+            2 => put_fixed::<T, 2>(out_buf, values),


I think it is ok, even if it is unlikely that 5 byte fixed length byte arrays are an important usecase

alamb · 2024-08-06T12:46:54Z

🚀

* Preallocate for `FixedSizeList` in `concat` (#5862) * Add specific fixed size list concat test * Add fixed size list concat benchmark * Improve `FixedSizeList` concat performance for large list * `cargo fmt` * Increase size of `FixedSizeList` benchmark data * Get capacity recursively for `FixedSizeList` * Reuse `Capacities::List` to avoid breaking change * Use correct default capacities * Avoid a `Box::new()` when not needed * format --------- Co-authored-by: Will Jones <willjones127@gmail.com> * Add eq benchmark for StringArray/StringViewArray (#5924) * add neq/eq benchmark for String/ViewArray * move bench to comparsion kernel * clean unnecessary dep * make clippy happy * Add the ability for Maps to cast to another case where the field names are different (#5703) * Add the ability for Maps to cast to another case where the field names are different. Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data. This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values. This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements. * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Feedback from code review - simplify map casting logic to reuse the entries - Added unit tests for negative cases - Use MapBuilder to make the intended type clearer. * fix formatting * Lint and format * correctly set the null fields --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix(ipc): set correct row count when reading struct arrays with zero fields (#5918) * Update zstd-sys requirement from >=2.0.0, <2.0.10 to >=2.0.0, <2.0.12 (#5913) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/compare/zstd-sys-2.0.7...zstd-sys-2.0.11) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add `MultipartUpload` blanket implementation for `Box<W>` (#5919) * add impl for box * update * another update * small fix * Fix typo in benchmarks (#5935) * row format benches for bool & nullable int (#5943) * Implement arrow-row encoding/decoding for view types (#5922) * implement arrow-row encoding/decoding for view types * add doc comments, better error msg, more test coverage * ensure no performance regression * update perf * fix bug * make fmt happy * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * update * update comments * move cmp around * move things around and remove inline hint * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update arrow-ord/src/cmp.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * return error instead of panic * remove unnecessary func --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * Better document support for nested comparison (#5942) * Update quick-xml requirement from 0.32.0 to 0.33.0 in /object_store (#5946) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.32.0...v0.33.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Implement like/ilike etc for StringViewArray (#5931) * like for string view array * fix bug * update doc * update tests * test: Add unit test for extending slice of list array (#5948) * test: Add unit test for extending slice of list array * For review * Update quick-xml requirement from 0.33.0 to 0.34.0 in /object_store (#5954) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.33.0...v0.34.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minor: fixup contribution guide (#5952) * chore(5797): change default data_page_row_limit to 20k (#5957) * Improve error message for unsupported nested comparison (#5961) * Improve error message for unsupported nested comparison * Update arrow-ord/src/cmp.rs Co-authored-by: Jay Zhan <jayzhan211@gmail.com> --------- Co-authored-by: Jay Zhan <jayzhan211@gmail.com> * feat: add max_bytes and min_bytes on PageIndex (#5950) * Faster primitive arrays encoding into row format (#5858) * skip iterator removed from primitive encoding * special cases for not-null primitives encoding * faster iterators for nullable columns * Document process for PRs with breaking changes (#5953) * Document process for PRs with breaking changes * ticket reference * Update CONTRIBUTING.md Co-authored-by: Xuanwo <github@xuanwo.io> --------- Co-authored-by: Xuanwo <github@xuanwo.io> * `like` benchmark for StringView (#5936) * Expose `IntervalMonthDayNano` and `IntervalDayTime` and update docs (#5928) * Expose IntervalMonthDayNano and IntervalDayMonth and update docs * fix doc test * implement sort for view types (#5963) * Fix FFI array offset handling (#5964) * Add benchmark for reading binary/binary view from parquet (#5968) * implement sort for view types * add bench for binary/binary view * Add view buffer for parquet reader (#5970) * implement sort for view types * add bench for binary/binary view * add view buffer, prepare for byte_view_array reader * make clippy happy * reuse make_view_unchecked * Update parquet/src/arrow/buffer/view_buffer.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update * rename and inline --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Handle flight dictionary ID assignment automatically (#5971) * failing test * Handle dict ID assignment during flight encoding/decoding * remove println * One more println * Make auto-assign optional * Update docs * Remove breaking change * Update arrow-ipc/src/writer.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove breaking change to DictionaryTracker ctor --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Make ObjectStoreScheme public (#5912) * Make ObjectStoreScheme public * Fix clippy, add docs and examples --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add operation in ArrowNativeTypeOp::neg_check error message (#5944) (#5980) * feat: support reading OPTIONAL column in parquet_derive (#5717) * support def_level=1 but non-null column in reader * update comment, adapt ut to the uuid change --------- Co-authored-by: Ye Yuan <yuanye_ptr@qq.com> * Update quick-xml requirement from 0.34.0 to 0.35.0 in /object_store (#5983) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.34.0...v0.35.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Reduce repo size by removing accumulative commits in CI job (#5982) * Use force_orphan in the CI job Use force_orphan in the CI job * Update .github/workflows/docs.yml --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: fix clippy complaint in parquet_derive (#5984) * Add user defined metadata (#5915) * Add metadata attribute * Add user-defined metadata for AWS/GCP/ABS `with_attributes` * Reads and writes both implemented * Add tests for GetClient * Fix an indentation * Placate clippy * Use `strip_prefix` and mutable attributes * Use static Cow for attribute metadata * Add error for value decode failure * Remove unnecessary into * Provide Arrow Schema Hint to Parquet Reader - Alternative 2 (#5939) * Adds option for providing a schema to the Arrow Parquet Reader. * Adds more complete tests. Adds a more detailed error message for incompatible columns. Adds nested fields to test_with_schema. Adds test for incompatible nested field. Updates documentation. * Add an example using showing how to use the with_schema option. --------- Co-authored-by: Eric Fredine <eric.fredine@beanworks.com> * WriteMultipart Abort on MultipartUpload::complete Error (#5974) * update * another one * more update * another update * debug * debug * some updates * debug * debug * cleanup * cleanup * simplify * address some comments * cleanup on failure * restore abort method * docs * Implement directly build byte view array on top of parquet buffer (#5972) * implement sort for view types * add bench for binary/binary view * add view buffer, prepare for byte_view_array reader * make clippy happy * add byte view array reader * fix doc link * reuse make_view_unchecked * Update parquet/src/arrow/buffer/view_buffer.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update * rename and inline * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * use unused * Revert "use unused" This reverts commit 5e6887095251066cfa9998cb16a9eea788f9e175. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: error in case of invalid interval expression (#5987) This PR addresses an error that occurs when interval expressions contains invalid amount of components. The error type was previously unclear and confusing: `NotYetImplemented`. That doesn't seem correct, because such values are not going to be supported. Let's take a look at such example: ```sql INTERVAL '1 MONTH DAY' ``` This is an obvious typo/mistake which leads to such error, but in fact it's just invalid value (missing number before `DAY`) * Add ParquetMetadata::memory_size size estimation (#5965) * Add ParquetMetadata::memory_size size estimation * Require HeapSize for ParquetValueType * feat(5851): ArrowWriter memory usage (#5967) * refactor(5851): delineate the different memory estimates APIs for the ArrowWriter and column writers * feat(5851): add memory size estimates to the ColumnValueEncoder implementations and the DictEncoder * test(5851): add memory_size() to in-progress test * chore(5851): update docs to make it more explicit what is the difference btwn memory_size vs get_estimated_total_byte * feat(5851): clarify the ColumnValueEncoder::estimated_memory_size interface, and update impls to account for bloom filter size * feat(5851): account for stats array size in the ByteArrayEncoder * Refine documentation * More accurate memory estimation * Improve tests * Update accounting for non dict encoded data * Include more memory size calculations * clean up async writer * clippy * tweak --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Prepare arrow `52.1.0` (#5992) * Update version to 52.1.0 * Prepare arrow `52.1.0` * Update CHANGELOG * Implement dictionary support for reading ByteView from parquet (#5973) * implement dictionary encoding support * update comments * implement `DataType::try_form(&str)` (#5994) * implement "DataType::try_form(&str)" * add missing file * add FromStr as well as TryFrom<&str> * fmt * Add additional documentation and examples to DataType (#5997) * Automatically cleanup empty dirs in LocalFileSystem (#5978) * automatically cleanup empty dirs * automatic cleanup toggle * configurable cleanup * test for automatic dir deletion * clippy * more comments * Add FlightSqlServiceClient::new_from_inner (#6003) * fix doc ci in latest rust nightly version (#6012) * allow rustdoc::unportable_markdown in arrow-flight. * fix doc in sql_info.rs. * reduce scope of lint disable --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Deduplicate strings/binarys when building view types (#6005) * implement string view deduplication in builder * make clippy happy * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * better coding style --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fast utf8 validation when loading string view from parquet (#6009) * fast utf8 validation * better documentation * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Rename `Schema::all_fields` to `flattened_fields` (#6001) * Rename Schema::all_fields to flattened_fields * Add doc example for Schema::flattened_fields * fmt doc example * Update arrow-schema/src/schema.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Complete `StringViewArray` and `BinaryViewArray` parquet decoder: implement delta byte array and delta length byte array encoding (#6004) * implement all encodings * address comments * fix bug * Update parquet/src/arrow/array_reader/byte_view_array.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix test * update comments * update test * Only copy strings one --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update zstd-sys requirement from >=2.0.0, <2.0.12 to >=2.0.0, <2.0.13 (#6019) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/compare/zstd-sys-2.0.7...zstd-sys-2.0.12) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update clap test (#6028) * Unsafe improvements: core `parquet` crate. (#6024) * Unsafe improvements: core `parquet` crate. * Make FromBytes an unsafe trait. * Improve performance reading `ByteViewArray` from parquet by removing an implicit copy (#6031) * update byte view array to not implicit copy * Add small comments * Update quick-xml requirement from 0.35.0 to 0.36.0 in /object_store (#6032) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](https://github.com/tafia/quick-xml/compare/v0.35.0...v0.36.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix `hashbrown` version in `arrow-array`, remove from `arrow-row` (#6035) * Additional tests for parquet reader utf8 validation (#6023) * Clean up unused code for view types in offset buffer (#6040) * clean up unused view types in offset buffer * make tests happy * Move avoid using copy-based buffer creation (#6039) * Fix 5592: Colon (:) in in object_store::path::{Path} is not handled on Windows (#5830) * Fix issue #5800: Handle missing files in list_with_delimiter * draft * cargo fmt * Handle leading colon * Add windows CI * Fix CI job * Only run local tests and set target family for failing tests * Run all tests without my changes and removed target os * Restore changes again * Add back newline (removed by mistake) * Fix test after merge with master * Minor API adjustments for StringViewBuilder (#6047) * minor update * add memory accounting * Update arrow-buffer/src/builder/null.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update arrow-array/src/builder/generic_bytes_view_builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update comments --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix typo in GenericByteViewArray documentation (#6054) * Directly decode String/BinaryView types from arrow-row format (#6044) * add string view bench * check in new impl * add utf8 * quick utf8 validation * Update arrow-row/src/variable.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * address comments * update * Revert "address comments" This reverts commit e2656c94dd5ff4fb2f486278feb346d44a7f5436. * addr comments --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add begin/end_transaction methods in FlightSqlServiceClient (#6026) * Add begin/end_transaction methods in FlightSqlServiceClient * Add test * Remove unused imports * Implement min max support for string/binary view types (#6053) * add * implement min max support for string/binary view * update tests * Add parquet `StatisticsConverter` for arrow reader (#6046) * Adds arrow statistics converter for parquet stastistics. * Adds integration tests for arrow statsistics converter. * Fix linting, remove todo, re-use arrow code. * Remove commented out debug::log statements. * Move parquet_column to lib.rs * doc tweaks * Add benchmark * Add parquet_column_index and arrow_field accessors + test * Copy edit docs obsessively * clippy --------- Co-authored-by: Eric Fredine <eric.fredine@beanworks.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * StringView support in arrow-csv (#6062) * StringView support in arrow-csv * review and micro-benches * Minor: clarify the relationship between `file::metadata` and `format` (#6049) * Do not write `ColumnIndex` for null columns when not writing page statistics (#6011) * disable column_index_builder if no page stats are collected * add test * no need to clone descr --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Reorganize arrow-flight test code (#6065) * Reorganize test code * asf header * reuse TestFixture * .await * Create flight_sql_client.rs * remove code * remove unused import * Fix clippy lints * Sanitize error message for sensitive requests (#6074) * Sanitize error message for sensitive requests * Clippy * use GCE metadata server env var overrides (#6015) * use GCE metadata env var overrides * update docs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Correct timeout in comment from 5s to 30s (#6073) * Prepare for object_store `0.10.2` release (#6079) * Prepare for `object_store 10.2.0` release * Add CHANGELOG * Historical changelog * Minor: Improve parquet PageIndex documentation (#6042) * Minor: Improve parquet PageIndex documentation * More improvements * Add reasons for data page being without null * Apply suggestions from code review Co-authored-by: Val Lorentz <progval+github@progval.net> * Update parquet/src/file/page_index/index.rs --------- Co-authored-by: Val Lorentz <progval+github@progval.net> * Enable casting from Utf8View (#6077) * Enable casting from Utf8View -> string or temporal types * save * implement casting utf8view -> timestamp/interval types, with tests * fix clippy * fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add PartialEq to ParquetMetaData and FileMetadata (#6082) Prep for #6000 * fix panic in `ParquetMetadata::memory_size`: check has_min_max_set before invoking min()/max() (#6092) * fix: check has_min_max_set before invoking min()/max() * chore: add unit test for statistics heap size * Fixup test --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Optimize `max_boolean` by operating on u64 chunks (#6098) * Optimize `max_boolean` Operate on bit chunks instead of individual booleans, which can lead to massive speedups while not regressing the short-circuiting behavior of the existing implementation. `cargo bench --bench aggregate_kernels -- "bool/max"` shows throughput improvements between 50% to 23390% on my machine. * add tests exercising u64 chunk code * add benchmark to track performance (#6101) * Make bool_or an alias for max_boolean (#6100) Improves `cargo bench --bench aggregate_kernels -- "bool/or"` throughput by 68%-22366% on my machine * Faster `GenericByteView` construction (#6102) * add benchmark to track performance * fast byte view construction * make doc happy * fix clippy * update comments * Implement specialized min/max for `GenericBinaryView` (`StringView` and `BinaryView`) (#6089) * implement better min/max for string view * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * address review comments --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Prepare `52.2.0` release (#6110) * Update version to 52.2.0 * Update CHANGELOG for 52.2.0 * touchups * manual tweaks * manual tweaks * added a flush method to IPC writers (#6108) While the writers expose `get_ref` and `get_mut` to access the underlying `io::Write` writer, there is an internal layer of a `BufWriter` that is not accessible. Because of that, there is no way to ensure that all messages written thus far to the `StreamWriter` or `FileWriter` have actually been passed to the underlying writer. Here we expose a `flush` method that flushes the internal buffer and the underlying writer. See #6099 for the discussion. * Fix Clippy for the Rust 1.80 release (#6116) * Fix clippy lints in arrow-data * Fix clippy errors in arrow-array * fix clippy in concat * Clippy in arrow-string * remove unecessary feature in arrow-array * fix clippy in arrow-cast * Fix clippy in parquet crate * Fix clippy in arrow-flight * Fix clippy in object_store crate (#6120) * Fix clippy in object_store crate * clippy ignore * Merge `53.0.0-dev` dev branch to main (#6126) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add support for level histograms added in PARQUET-2261 to `ParquetMetaData` (#6105) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * deprecate read_page_locations * add level histograms to metadata * add to_thrift() to OffsetIndexMetaData * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * move valid test into ColumnIndexBuilder::append_histograms * move update_histogram() inside ColumnMetrics * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Implement LevelHistograms as a struct * formatting * fix error in docs --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add ArrowError::ArithmeticError (#6130) * Implement data_part for intervals (#6071) Signed-off-by: Nick Cameron <nrc@ncameron.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Remove `SchemaBuilder` dependency from `StructArray` constructors (#6139) * Remove automatic buffering in `ipc::reader::FileReader` for for consistent buffering (#6132) * change ipc::reader and writer APIs for consistent buffering Current writer API automatically wraps the supplied std::io::Writer impl into a BufWriter. It is cleaner and more idiomatic to have the default be using the supplied impl directly, as the user might already have a BufWriter or an impl that doesn't actually benefit from buffering at all. StreamReader does a similar thing, but it also exposes a `try_new_unbuffered` that bypasses the internal wrap. Here we propose a consistent and non-buffered by default API: - `try_new` does not wrap the passed reader/writer, - `try_new_buffered` is a convenience function that does wrap the reader/writer into a BufReader/BufWriter, - all four publicly exposed IPC reader/writers follow the above consistently, i.e. `StreamReader`, `FileReader`, `StreamWriter`, `FileWriter`. Those are breaking changes. An additional tweak: removed the generic type bounds from struct definitions on the four types, as that is the idiomatic Rust approach (see e.g. stdlib's HashMap that has no bounds on the struct definition, only the impl requires Hash + Eq). See #6099 for the discussion. * improvements to docs in `arrow::ipc::reader` and `writer` Applied a few suggestions, made `Error` sections more consistent. * Use `LevelHistogram` in `PageIndex` (#6135) * use LevelHistogram in PageIndex and ColumnIndexBuilder * revert changes to OffsetIndexBuilder * Fix comparison kernel benchmarks (#6147) * fix comparison kernel benchmarks * add comment as suggested by @alamb * Implement exponential block size growing strategy for `StringViewBuilder` (#6136) * new block size growing strategy * Update arrow-array/src/builder/generic_bytes_view_builder.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update function name, deprecate old function * update comments --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * improve LIKE regex (#6145) * Improve `LIKE` performance for "contains" style queries (#6128) * improve "contains" performance * add tests * cargo fmt :disappointed: --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * improvements to `(i)starts_with` and `(i)ends_with` performance (#6118) * improvements to "starts_with" and "ends_with" * add tests and refactor slightly * add comments * Add `BooleanArray::new_from_packed` and `BooleanArray::new_from_u8` (#6127) * Support construct BooleanArray from &[u8] * fix doc * add new_from_packed and new_from_u8; delete impl From<&[u8]> for BooleanArray and BooleanBuffer * Update object store MSRV to `1.64` (#6123) * Update MSRV to 1.64 * Revert "clippy ignore" This reverts commit 7a4b760bfb2a63c7778b20a4710c2828224f9565. * Upgrade protobuf definitions to flightsql 17.0 (#6133) (#6169) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com> * Add additional documentation and examples to ArrayAccessor (#6141) * Minor: Update release schedule in README (#6125) * Minor: Update release schedule in README * prettier * fixp * Optimize `take` kernel for `BinaryViewArray` and `StringViewArray` (#6168) * improve speed of view take kernel * ArrayData -> new_unchecked * Update arrow-select/src/take.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: improve comments in temporal.rs tests (#6140) * Support `StringView` and `BinaryView` in CDataInterface (#6171) * fix round-trip for view schema in CFFI * add * Make object_store errors non-exhaustive (#6165) * Update snafu (#5930) (#6070) Co-authored-by: Jesse <github@jessebakker.com> * Update sysinfo requirement from 0.30.12 to 0.31.2 (#6182) * Update sysinfo requirement from 0.30.12 to 0.31.2 Updates the requirements on [sysinfo](https://github.com/GuillaumeGomez/sysinfo) to permit the latest version. - [Changelog](https://github.com/GuillaumeGomez/sysinfo/blob/master/CHANGELOG.md) - [Commits](https://github.com/GuillaumeGomez/sysinfo/compare/v0.30.13...v0.31.2) --- updated-dependencies: - dependency-name: sysinfo dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update example for new sysinfo API --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * No longer write Parquet column metadata after column chunks *and* in the footer (#6117) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * no longer write inline column metadata * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add some more documentation * remove write_metadata from PageWriter --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * add filter benchmark for fsb (#6186) * Add support for `StringView` and `BinaryView` statistics in `StatisticsConverter` (#6181) * Add StringView and BinaryView support for the macro `get_statistics` * Add StringView and BinaryView support for the macro `get_data_page_statistics` * add tests to cover the support for StringView and BinaryView in the macro get_data_page_statistics * found potential bugs and ignore the tests * fake alarm! no bugs, fix the code by initiating all batches to have 5 rows * make the get_stat StringView and BinaryView tests cover bytes greater than 12 * Benchmarks for `bool_and` (#6189) * Fix typo in documentation of Float64Array (#6188) * feat(parquet): Implement AsyncFileWriter for `object_store::buffered::BufWriter` (#6013) * feat(parquet): Implement AsyncFileWriter for obejct_store::BufWriter Signed-off-by: Xuanwo <github@xuanwo.io> * Fix build Signed-off-by: Xuanwo <github@xuanwo.io> * Bump object_store Signed-off-by: Xuanwo <github@xuanwo.io> * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Address comments Signed-off-by: Xuanwo <github@xuanwo.io> * Add comments Signed-off-by: Xuanwo <github@xuanwo.io> * Make it better to read Signed-off-by: Xuanwo <github@xuanwo.io> * Fix docs Signed-off-by: Xuanwo <github@xuanwo.io> --------- Signed-off-by: Xuanwo <github@xuanwo.io> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Support Parquet `BYTE_STREAM_SPLIT` for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types (#6159) * add todos to help trace flow * add support for byte_stream_split encoding for INT32 and INT64 data * byte_stream_split encoding for fixed_len_byte_array * revert changes to Decoder and add VariableWidthByteStreamSplitDecoder * remove set_type_width as it is now unused * begin implementing roundtrip test * move test * clean up some documentation * add test of byte_stream_split with flba * add check for and test of mismatched sizes * remove type_length from Encoder and add VaribleWidthByteStreamSplitEncoder * fix clippy error * change type of argument to new() * formatting * add another test * add variable to split/join streams for FLBA * more informative error message * avoid buffer copies in decoder per suggestion from review * add roundtrip test * optimized version...but clippy complains * clippy was right...replace loop with copy_from_slice * fix test * optimize split_streams_variable for long type widths * Reduce bounds check in `RowIter`, add `unsafe Rows::row_unchecked` (#6142) * update * update comment * update row-iter bench * make clippy happy * Update zstd-sys requirement from >=2.0.0, <2.0.13 to >=2.0.0, <2.0.14 (#6196) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](https://github.com/gyscos/zstd-rs/commits) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add `ThriftMetadataWriter` for writing Parquet metadata (#6197) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b4432c9838f2536284015271d3de9a165135. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Upgrade protobuf definitions to flightsql 17.0 (#6133) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. * Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` * fix loading in test by etseidl Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * add rough equivalence test * one more check * make clippy happy * separate tests that require arrow into a separate module * add histograms to to_thrift() --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com> Co-authored-by: Ed Seidl <etseidl@live.com> * Add (more) Parquet Metadata Documentation (#6184) * Minor: Add (more) Parquet Metadata Documenation * fix clippy * fix parquet type is_optional comment (#6192) Co-authored-by: jp0317 <zjpzlz@gmail.com> * Remove duplicated statistics tests in parquet (#6190) * move all tests to parquet/tests/arrow_reader/statistics.rs, and leave a comment in original file * remove duplicated tests and adjust the empty tests * data file tests brought folders changes * fix lint * add comments Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: interleave docs suggests itself, not take (#6210) * fix: Correctly handle take on dense union of a single selected type (#6209) * fix: use filter instead of filter_primitive * fix: remove pub(crate) from filter_primitive * fix: run cargo fmt * fix: clippy * Make it clear that StatisticsConverter can not panic (#6187) * Optimize `min_boolean` and `bool_and` (#6144) * Optimize `min_boolean` and `bool_and` Closes #https://github.com/apache/arrow-rs/issues/6103 * use any * Add benchmarks for `BYTE_STREAM_SPLIT` encoded Parquet `FIXED_LEN_BYTE_ARRAY` data (#6204) * save type_width for fixed_len_byte_array * add decimal128 and float16 byte_stream_split benches * add f16 * add decimal128 flba(16) bench * fix(arrow): restrict the range of temporal values produced via `data_gen` (#6205) * fix: random timestamp array * fix: restrict range of randomly generated temporal values * fix: exclusive range used * Support casting between BinaryView <--> Utf8 and LargeUtf8 (#6180) * support cast between binaryview and string * update impl. and add bench mark * Add ut for views * Apply coments * feat(object_store): add `PermissionDenied` variant to top-level error (#6194) * feat(object_store): add `PermissionDenied` variant to top-level error * Update object_store/src/lib.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * refactor: add additional error variant for unauthenticated ops * fix: include path in unauthenticated error --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * update BYTE_STREAM_SPLIT documentation (#6212) * Add time dictionary coercions (#6208) * Add time dictionary coercions * format * Pass through primitive values * use spaces not tabs everywhere (#6217) * Implement specialized filter kernel for `FixedSizeByteArray` (#6178) * refactor filter for FixedSizeByteArray * fix expect * remove benchmark code * fix * remove from_trusted_len_iter_slice_u8 * fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: lexsort_to_indices should not fallback to non-lexical sort if the datatype is not supported (#6225) * fix: lexsort_to_indices should not fallback to non-lexical sort if the datatype is not supported * fix clippy * Check error message * Prepare for object_store `0.11.0` release (#6227) * Update version to 0.11.0 * Changelog for 0.11.0 * Remove irrelevant content from changelog * Improve interval parsing (#6211) * improve interval parsing * rename * cleanup * fix formatting * make IntervalParseConfig public * add debug to IntervalParseConfig * fmt * Add LICENSE and NOTICE files to object_store (#6234) * Add LICENSE and NOTICE files to object_store * Update object_store/NOTICE.txt Co-authored-by: Xuanwo <github@xuanwo.io> * Update object_store/LICENSE.txt --------- Co-authored-by: Xuanwo <github@xuanwo.io> * Update changelog for object_store 0.11.0 release (#6238) * Minor: Remove non standard footer from LICENSE.txt (#6237) * Minor: Improve Type documentation (#6224) * Minor: Improve XXXType documentation * Update arrow-array/src/types.rs Co-authored-by: Marco Neumann <marco@crepererum.net> --------- Co-authored-by: Marco Neumann <marco@crepererum.net> * Add "take" workflow for self-assigning tickets, add "how to find issues" to contributor guide (#6059) * Add "take" workflow for contributors to assign themselves to tickets * Copy datafusion Finding and Creating Issues to work on * Move `ParquetMetadataWriter` to its own module, update documentation (#6202) * Move `ThriftMetadataWriter` and `ParquetMetadataWriter` to a new module * Improve documentation, make pub(crate) * Apply suggestions from code review Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Add comment side effect of writing column and offset indexes * Document how to write bloom filters * Update parquet/src/file/metadata/writer.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Modest improvement to FixedLenByteArray BYTE_STREAM_SPLIT arrow decoder (#6222) * replace reserve/push with resize/direct access * remove import * make a bit faster * Improve performance of `FixedLengthBinary` decoding (#6220) * add set_from_bytes to ParquetValueType * change naming of FLBA types so critcmp will work * minor enhance doc for ParquetField (#6239) * Remove unnecessary null buffer construction when converting arrays to a different type (#6244) * create primitive array from iter and nulls * clippy * speed up some more decimals * add optimizations for byte_stream_split * decimal256 * Revert "add optimizations for byte_stream_split" This reverts commit 5d4ae0dc09f95ee9079b46b117fb554f63157564. * add comments * Add examples to `StringViewBuilder` and `BinaryViewBuilder` (#6240) * Add examples to `StringViewBuilder` and `BinaryViewBuilder` * add doc link * Implement PartialEq for GenericBinaryArray (#6241) * parquet Statistics - deprecate `has_*` APIs and add `_opt` functions that return `Option<T>` (#6216) * update public api Statistics::min to return an option. I first re-named the existing method to `min_unchecked` and made it internal to the crate. I then added a `pub min(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: https://github.com/apache/arrow-rs/issues/6093 * update public api Statistics::max to return an option. I first re-named the existing method to `max_unchecked` and made it internal to the crate. I then added a `pub max(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: https://github.com/apache/arrow-rs/issues/6093 * cargo fmt * remove Statistics::has_min_max_set from the public api Ref: https://github.com/apache/arrow-rs/issues/6093 * update impl HeapSize for ValueStatistics to use new min and max api * migrate all tests to new Statistics min and max api * make Statistics::null_count return Option<u64> This removes ambiguity around whether the between all values are non-null or just that the null count stat is missing Ref: https://github.com/apache/arrow-rs/issues/6215 * update expected metadata memory size tests Changing null_count from u64 to Option<u64> increases the memory size and layout of the metadata. I included these tests as a separate commit to call extra attention to it. * add TODO question on is_min_max_backwards_compatible * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * update ValueStatistics::max docs * rename new optional ValueStatistics::max to max_opt Per PR review, we will deprecate the old API instead of introducing a brekaing change. Ref: https://github.com/apache/arrow-rs/pull/6216#pullrequestreview-2236537291 * rename new optional ValueStatistics::min to min_opt * add Statistics:{min,max}_bytes_opt This adds the API and migrates all of the test usage. The old APIs will be deprecated next. * update make_stats_iterator macro to use *_opt methods * deprecate non *_opt Statistics and ValueStatistics methods * remove stale TODO comments * remove has_min_max_set check from make_decimal_stats_iterator The check is unnecessary now that the stats funcs return Option<T> when unset. * deprecate has_min_max_set An internal version was also created because it is used so extensively in testing. * switch to null_count_opt and reintroduce deprecated null_count and has_nulls * remove redundant test assertions of stats._internal_has_min_max_set This removes the assertion from any test that subsequently unwraps both min_opt and max_opt. * replace negated test assertions of stats._internal_has_mix_max_set with assertions on min_opt and max_opt This removes all use of Statistics::_internal_has_min_max_set from the code base, and so it is also removed. * Revert changes to parquet writing, update comments --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Update DateType::Date64 docs (#6223) * feat(object_store): add support for server-side encryption with customer-provided keys (SSE-C) (#6230) * Add support for server-side encryption with customer-provided keys (SSE-C). * Add SSE-C test using MinIO. * Visibility change * add nocapture to verify the test indeed runs * cargo fmt * Update object_store/src/aws/mod.rs use environment variables Co-authored-by: Will Jones <willjones127@gmail.com> * Update object_store/CONTRIBUTING.md use environment variables Co-authored-by: Will Jones <willjones127@gmail.com> * Fix api --------- Co-authored-by: Will Jones <willjones127@gmail.com> * Expose bulk ingest in flight sql client and server (#6201) * Expose CommandStatementIngest as pub in sql module * Add do_put_statement_ingest to FlightSqlService Dispatch this handler for the new CommandStatementIngest command. * Sort list * Implement stub do_put_statement_ingest in example * Refactor helper functions into tests/common/utils * Implement execute_ingest for flight sql client I referenced the C++ implementation here: https://github.com/apache/arrow/commit/0d1ea5db1f9312412fe2cc28363e8c9deb2521ba * Add integration test for sql client execute_ingest * Fix lint clippy::new_without_default * Allow streaming ingest for FlightClient::execute_ingest * Properly return client errors --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * docs: Add parquet_opendal in related projects (#6236) * docs: Add parquet_opendal in related projects * Fix spaces * Avoid infinite loop in bad parquet by checking the number of rep levels (#6232) * check the number of rep levels read from page * minor fix on typo Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add check on record_read as well --------- Co-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Make the bearer token visible in FlightSqlServiceClient (#6254) * Make the bearer token visible in FlightSqlServiceClient * Update client.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add tests for bad parquet files (#6262) * Add tests for bad parquet files * Reenable test * Add test for very subltley different file * Update parquet object_store dependency to 0.11.0 (#6264) * Implement date_part for durations (#6246) Signed-off-by: Nick Cameron <nrc@ncameron.org> * feat: further TLS options on ClientOptions: #5034 (#6148) * feat: further TLS options on ClientOptions: #5034 * Rename to Certificate and with_root_certificate, add docs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Improve documentation for MutableArrayData (#6272) * Do not print compression level in schema printer (#6271) The compression level is only used during compression, not decompression, and isn't actually stored in the metadata. Printing it is misleading. * Add `Statistics::distinct_count_opt` and deprecate `Statistics::distinct_count` (#6259) * Fix accessing name from ffi schema (#6273) * Fix accessing name from ffi schema * Add test * ci: use octokit to add assignee (#6267) * Only add encryption headers for for SSE-C in get. (#6260) * Minor: move `FallibleRequestStream` and `FallibleTonicResponseStream` to a module (#6258) * Minor: move FallibleRequestStream and FallibleTonicResponseStream to their own modules * Improve documentation and add links * Minor: `pub use ByteView` in arrow and improve documentation (#6275) * Minor: `pub use ByteView` in arrow and improve documentation * clarify docs more * ci: simplify octokit add assignee (#6280) * Update tower requirement from 0.4.13 to 0.5.0 (#6250) * Update tower requirement from 0.4.13 to 0.5.0 Updates the requirements on [tower](https://github.com/tower-rs/tower) to permit the latest version. - [Release notes](https://github.com/tower-rs/tower/releases) - [Commits](https://github.com/tower-rs/tower/compare/tower-0.4.13...tower-0.5.0) --- updated-dependencies: - dependency-name: tower dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Add tower version --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix panic in comparison_kernel benchmarks (#6284) * Fix panic in comparison_kernel benchmarks * Add other special case equality kernels * Add other benchmarks * fix reference in doctest to size_of which is not imported by default (#6286) This corrects an issue with this doctest noticed on FreeBSD/amd64 with rustc 1.77.0 * Use `unary()` for array conversion in Parquet array readers, speed up `Decimal128`, `Decimal256` and `Float16` (#6252) * add unary to FixedSizeBinaryArray; use unary for…

etseidl added 3 commits July 30, 2024 10:28

add todos to help trace flow

71ae965

add support for byte_stream_split encoding for INT32 and INT64 data

dddf8a0

byte_stream_split encoding for fixed_len_byte_array

1c1af32

github-actions bot added the parquet Changes to the parquet crate label Jul 30, 2024

alamb reviewed Jul 30, 2024

View reviewed changes

alamb changed the title ~~Extend Parquet BYTE_STREAM_SPLIT support to INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types~~ Support Parquet BYTE_STREAM_SPLIT for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types Jul 30, 2024

alamb reviewed Jul 30, 2024

View reviewed changes

etseidl added 4 commits July 30, 2024 14:59

revert changes to Decoder and add VariableWidthByteStreamSplitDecoder

75ea319

remove set_type_width as it is now unused

7ce40ae

begin implementing roundtrip test

e7829c3

move test

c0eb828

alamb mentioned this pull request Aug 1, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 apache/datafusion#11710

Closed

8 tasks

clean up some documentation

fec8001

etseidl added 8 commits August 1, 2024 10:27

add test of byte_stream_split with flba

b9d4baf

add check for and test of mismatched sizes

f8ee320

remove type_length from Encoder and add VaribleWidthByteStreamSplitEn…

29c5119

…coder

Merge remote-tracking branch 'origin/master' into bss

3e598be

fix clippy error

ef14c7d

change type of argument to new()

6513ffb

formatting

3a650a7

add another test

c63a1ce

alamb reviewed Aug 2, 2024

View reviewed changes

etseidl added 3 commits August 2, 2024 10:24

add variable to split/join streams for FLBA

09f467d

more informative error message

3fd6bc5

avoid buffer copies in decoder per suggestion from review

c6bb2ef

alamb approved these changes Aug 2, 2024

View reviewed changes

etseidl added 5 commits August 5, 2024 13:59

add roundtrip test

3f6d944

optimized version...but clippy complains

97d159b

clippy was right...replace loop with copy_from_slice

340eab4

fix test

b2d90ce

optimize split_streams_variable for long type widths

104b72e

mapleFU reviewed Aug 6, 2024

View reviewed changes

alamb approved these changes Aug 6, 2024

View reviewed changes

alamb merged commit 2a4f269 into apache:master Aug 6, 2024
16 checks passed

etseidl mentioned this pull request Aug 6, 2024

Add benchmarks for BYTE_STREAM_SPLIT encoded Parquet FIXED_LEN_BYTE_ARRAY data #6203

Closed

etseidl deleted the bss branch August 8, 2024 17:14

etseidl mentioned this pull request Aug 8, 2024

Update documentation for Parquet BYTE_STREAM_SPLIT encoding #6212

Merged

alamb mentioned this pull request Aug 9, 2024

Look into optimizing reading FixedSizeBinary arrays from parquet #6219

Closed

etseidl mentioned this pull request Aug 9, 2024

Improve performance of FixedLengthBinary decoding #6220

Merged

alamb mentioned this pull request Aug 31, 2024

Extend support for BYTE_STREAM_SPLIT to FIXED_LEN_BYTE_ARRAY, INT32, and INT64 primitive types #6048

Closed

	fn split_streams(src: &[u8], dst: &mut [u8], type_size: usize) {
	fn split_streams_variable(src: &[u8], dst: &mut [u8], type_size: usize) {

Support Parquet BYTE_STREAM_SPLIT for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types #6159

Support Parquet BYTE_STREAM_SPLIT for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types #6159

Conversation

etseidl commented Jul 30, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Jul 30, 2024

etseidl commented Jul 30, 2024

alamb commented Jul 30, 2024

Choose a reason for hiding this comment

etseidl Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Jul 30, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 31, 2024

etseidl commented Aug 1, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Aug 2, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Aug 2, 2024

etseidl commented Aug 2, 2024

alamb commented Aug 2, 2024

etseidl commented Aug 2, 2024 • edited Loading

alamb commented Aug 2, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Aug 5, 2024 • edited Loading

etseidl commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 6, 2024

Support Parquet `BYTE_STREAM_SPLIT` for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types #6159

Support Parquet `BYTE_STREAM_SPLIT` for INT32, INT64, and FIXED_LEN_BYTE_ARRAY primitive types #6159

etseidl Jul 30, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

etseidl commented Aug 2, 2024 •

edited

Loading

etseidl commented Aug 5, 2024 •

edited

Loading