Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Performance regression in TypedColumnWriter::WriteBatch #37453

Closed
adamreeve opened this issue Aug 30, 2023 · 4 comments · Fixed by #37454
Closed

[C++][Parquet] Performance regression in TypedColumnWriter::WriteBatch #37453

adamreeve opened this issue Aug 30, 2023 · 4 comments · Fixed by #37454

Comments

@adamreeve
Copy link
Contributor

adamreeve commented Aug 30, 2023

Describe the bug, including details regarding any error messages, version, and platform.

When running our benchmarks for ParquetSharp against the new Arrow 13.0.0 release, one of the benchmarks was twice as slow compared to Arrow 12.0.1. Digging into this further, I found that the change that caused this slow down was #35230, and it appears that the reason for the slowdown is this particular benchmark makes a lot of calls to WriteBatch with very small batches, and there is now a lot of extra time spent looking up properties_->page_index_enabled(descr_->path()) in pages_change_on_record_boundaries().

It looks to me like the column properties shouldn't change one a TypedColumnWriter is constructed, so it should be fine to compute this value once on construction to fix this performance regression.

Component(s)

Parquet

@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2023

Oh I see, I will try to fix this

@adamreeve
Copy link
Contributor Author

Hi @mapleFU, no we haven't enabled the page index and we aren't using DATA_PAGE_V2, but I don't think your comment is correct, page_index_enabled is called when data page V2 is not used, as there is an OR condition:

  bool pages_change_on_record_boundaries() const {
    return properties_->data_page_version() == ParquetDataPageVersion::V2 ||
           properties_->page_index_enabled(descr_->path());
  }

I have tested a fix that moves the calculation of this to the constructor and that fixes the performance regression, so I was planning on making a PR for this soon but was just working on getting the tests to run locally.

@adamreeve
Copy link
Contributor Author

My comment looks a bit silly now after your edit 😃. I had already tested a fix for this so have opened a PR with my change at #37454, but I'm happy for you to fix this instead if you see a better way as I'm not super familiar with the codebase.

@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2023

Oh you've already submit a patch, I'm just trying to do the same thing, thanks a lot!

@kou kou changed the title Performance regression in TypedColumnWriter::WriteBatch [C++][Parquet] Performance regression in TypedColumnWriter::WriteBatch Aug 30, 2023
pitrou pushed a commit that referenced this issue Aug 30, 2023
### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with #35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: #37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Aug 30, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…#37454)

### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: apache#37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…#37454)

### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: apache#37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants