Skip to content

Commit

Permalink
apacheGH-41317: [C++] Fix crash on invalid Parquet file (apache#41366)
Browse files Browse the repository at this point in the history
### Rationale for this change

Fixes the crash detailed in apache#41317 in TableBatchReader::ReadNext() on a corrupted Parquet file

### What changes are included in this PR?

Add a validation that all read columns have the same size

### Are these changes tested?

I've tested on the reproducer I provided in apache#41317 that it now triggers a clean error:
```
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [_ for _ in parquet_file.iter_batches()]
  File "test.py", line 3, in <listcomp>
    [_ for _ in parquet_file.iter_batches()]
  File "pyarrow/_parquet.pyx", line 1587, in iter_batches
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: columns do not have the same size
```
I'm not sure if/how unit tests for corrupted datasets should be added

### Are there any user-facing changes?

No

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#41317

Authored-by: Even Rouault <even.rouault@spatialys.com>
Signed-off-by: mwish <maplewish117@gmail.com>
  • Loading branch information
rouault authored Apr 30, 2024
1 parent de37ee8 commit e4f3146
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 0 deletions.
2 changes: 2 additions & 0 deletions cpp/src/arrow/table.cc
Original file line number Diff line number Diff line change
Expand Up @@ -619,6 +619,7 @@ TableBatchReader::TableBatchReader(const Table& table)
for (int i = 0; i < table.num_columns(); ++i) {
column_data_[i] = table.column(i).get();
}
DCHECK(table_.Validate().ok());
}

TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
Expand All @@ -632,6 +633,7 @@ TableBatchReader::TableBatchReader(std::shared_ptr<Table> table)
for (int i = 0; i < owned_table_->num_columns(); ++i) {
column_data_[i] = owned_table_->column(i).get();
}
DCHECK(table_.Validate().ok());
}

std::shared_ptr<Schema> TableBatchReader::schema() const { return table_.schema(); }
Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/table.h
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,8 @@ class ARROW_EXPORT Table {
///
/// The conversion is zero-copy: each record batch is a view over a slice
/// of the table's columns.
///
/// The table is expected to be valid prior to using it with the batch reader.
class ARROW_EXPORT TableBatchReader : public RecordBatchReader {
public:
/// \brief Construct a TableBatchReader for the given table
Expand Down
10 changes: 10 additions & 0 deletions cpp/src/parquet/arrow/reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1043,6 +1043,16 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
}
}

// Check all columns has same row-size
if (!columns.empty()) {
int64_t row_size = columns[0]->length();
for (size_t i = 1; i < columns.size(); ++i) {
if (columns[i]->length() != row_size) {
return ::arrow::Status::Invalid("columns do not have the same size");
}
}
}

auto table = ::arrow::Table::Make(batch_schema, std::move(columns));
auto table_reader = std::make_shared<::arrow::TableBatchReader>(*table);

Expand Down

0 comments on commit e4f3146

Please sign in to comment.