Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract add.stats_parsed with wrong type #2312

Closed
yjshen opened this issue Mar 22, 2024 · 2 comments · Fixed by #2405
Closed

Extract add.stats_parsed with wrong type #2312

yjshen opened this issue Mar 22, 2024 · 2 comments · Fixed by #2405
Labels
bug Something isn't working

Comments

@yjshen
Copy link
Contributor

yjshen commented Mar 22, 2024

Bug

During delta log reading, extracting add.stats_parsed with the wrong StringArray type (on line 82) results in a double stats_parsed column in the result batch.

fn map_batch(
batch: RecordBatch,
stats_schema: ArrowSchemaRef,
config: &DeltaTableConfig,
) -> DeltaResult<RecordBatch> {
let stats_col = ex::extract_and_cast_opt::<StringArray>(&batch, "add.stats");
let stats_parsed_col = ex::extract_and_cast_opt::<StringArray>(&batch, "add.stats_parsed");
if stats_parsed_col.is_some() {
return Ok(batch);
}
if let Some(stats) = stats_col {
let stats: Arc<StructArray> =
Arc::new(json::parse_json(stats, stats_schema.clone(), config)?.into());
let schema = batch.schema();

Also, the newly generated stats_parsed from stats has a different array length than other columns.

source: Arrow { source: InvalidArgumentError("Incorrect array length for StructArray field \"stats_parsed\", expected 10 got 14")

What you expected to happen:

  1. extract add.stats_parsed with StructArray type
  2. generated stats_parsed should have a same number of rows as other cols.

How to reproduce it:

More details:

@yjshen yjshen added the bug Something isn't working label Mar 22, 2024
@ion-elgreco
Copy link
Collaborator

@yjshen can you add some reproducible code?

@sjohnston
Copy link

I've seen this error when trying to read an old table with 10s of thousands of partitions. I started removing old partitions and vacuum/compact and the error went away. Maybe the schema changed at some point or something wrong with the old data. I'm not sure.

ion-elgreco pushed a commit that referenced this issue Apr 15, 2024
…2405)

# Description
- `stats_parsed` is a StructArray instead of StringArray
- Parse `Add` action's `stats` to `stats_parsed` would panic due to the
use of `slice.array_data()`.

# Related Issue(s)
<!---
For example:

- 
--->

closes #2312 

# Documentation

<!---
Share links to useful documentation
--->

https://docs.rs/arrow/51.0.0/arrow/array/struct.GenericByteArray.html#method.value_data

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants