Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat_batches panics with total_len <= bit_len assertion for records with lists #4324

Closed
joshg-ec opened this issue May 31, 2023 · 4 comments · Fixed by #4343
Closed

concat_batches panics with total_len <= bit_len assertion for records with lists #4324

joshg-ec opened this issue May 31, 2023 · 4 comments · Fixed by #4343
Assignees
Labels
arrow Changes to the arrow crate bug

Comments

@joshg-ec
Copy link

joshg-ec commented May 31, 2023

Describe the bug
concat, used by concat_batches, does not appear to allocate sufficient capacities when constructing the MutableArrayData. Concatenating records that contain lists of structs results in the following panic:

assertion failed: total_len <= bit_len
thread 'concat_test' panicked at 'assertion failed: total_len <= bit_len', /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-40.0.0/src/buffer/boolean.rs:55:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:114:5
   3: arrow_buffer::buffer::boolean::BooleanBuffer::new
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-40.0.0/src/buffer/boolean.rs:55:9
   4: arrow_data::transform::_MutableArrayData::freeze::{{closure}}
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:81:25
   5: core::bool::<impl bool>::then
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/bool.rs:71:24
   6: arrow_data::transform::_MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:80:21
   7: arrow_data::transform::MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:656:18
   8: arrow_data::transform::_MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:74:37
   9: arrow_data::transform::MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:656:18
  10: arrow_data::transform::_MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:74:37
  11: arrow_data::transform::MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:656:18
  12: arrow_data::transform::_MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:74:37
  13: arrow_data::transform::MutableArrayData::freeze
             at /Users/x/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-40.0.0/src/transform/mod.rs:656:18

To Reproduce
Call concat_batches with RecordBatchs that contain lists of structs (on the order of 20–50 structs in the list per RecordBatch). If I modify the capacity calculation in concat to add a constant factor for lists, the error does not occur:

    let capacity = match d {
        DataType::Utf8 => binary_capacity::<Utf8Type>(arrays),
        DataType::LargeUtf8 => binary_capacity::<LargeUtf8Type>(arrays),
        DataType::Binary => binary_capacity::<BinaryType>(arrays),
        DataType::LargeBinary => binary_capacity::<LargeBinaryType>(arrays),
        DataType::List(_) => {
            Capacities::Array(arrays.iter().map(|a| a.len()).sum::<usize>() + 500) // <- 500 added here
        }
        _ => Capacities::Array(arrays.iter().map(|a| a.len()).sum()),
    };

Expected behavior
No panics when concatenating RecordBatchs with lists.

Additional context
Reproduced with Arrow versions 37–40.

@joshg-ec joshg-ec added the bug label May 31, 2023
@joshg-ec joshg-ec changed the title concat_batches fails with total_len <= bit_len assertion for records with lists concat_batches panics with total_len <= bit_len assertion for records with lists May 31, 2023
@tustvold
Copy link
Contributor

tustvold commented May 31, 2023

Possibly related to #1230. The error would suggest that the validity buffer is not the correct length. I'll take a look in a bit, MutableArrayData is overdue some TLC in this regard (#1225)

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

Are you able to share a reproducer for this?

@joshg-ec
Copy link
Author

joshg-ec commented Jun 1, 2023

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2023

Thank you, very helpful. It looks like this is #1230, which I will fix now

As a happy accident #4333 fixed your reproducer as it removed the use of extend_nulls when appending structs

@alamb alamb assigned alamb and tustvold and unassigned alamb Jun 2, 2023
@tustvold tustvold added the arrow Changes to the arrow crate label Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants