Skip to content

Commit

Permalink
apacheGH-37453: [C++][Parquet] Performance fix for WriteBatch (apache…
Browse files Browse the repository at this point in the history
…#37454)

### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: apache#37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
adamreeve authored and dgreiss committed Feb 17, 2024
1 parent fae14e4 commit 42a0e09
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions cpp/src/parquet/column_writer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1219,6 +1219,9 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
page_statistics_ = MakeStatistics<DType>(descr_, allocator_);
chunk_statistics_ = MakeStatistics<DType>(descr_, allocator_);
}
pages_change_on_record_boundaries_ =
properties->data_page_version() == ParquetDataPageVersion::V2 ||
properties->page_index_enabled(descr_->path());
}

int64_t Close() override { return ColumnWriterImpl::Close(); }
Expand Down Expand Up @@ -1386,8 +1389,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
const WriterProperties* properties() override { return properties_; }

bool pages_change_on_record_boundaries() const {
return properties_->data_page_version() == ParquetDataPageVersion::V2 ||
properties_->page_index_enabled(descr_->path());
return pages_change_on_record_boundaries_;
}

private:
Expand All @@ -1402,6 +1404,7 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
DictEncoder<DType>* current_dict_encoder_;
std::shared_ptr<TypedStats> page_statistics_;
std::shared_ptr<TypedStats> chunk_statistics_;
bool pages_change_on_record_boundaries_;

// If writing a sequence of ::arrow::DictionaryArray to the writer, we keep the
// dictionary passed to DictEncoder<T>::PutDictionary so we can check
Expand Down

0 comments on commit 42a0e09

Please sign in to comment.