-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Performance regression in TypedColumnWriter::WriteBatch #37453
Comments
Oh I see, I will try to fix this |
Hi @mapleFU, no we haven't enabled the page index and we aren't using DATA_PAGE_V2, but I don't think your comment is correct,
I have tested a fix that moves the calculation of this to the constructor and that fixes the performance regression, so I was planning on making a PR for this soon but was just working on getting the tests to run locally. |
My comment looks a bit silly now after your edit 😃. I had already tested a fix for this so have opened a PR with my change at #37454, but I'm happy for you to fix this instead if you see a better way as I'm not super familiar with the codebase. |
Oh you've already submit a patch, I'm just trying to do the same thing, thanks a lot! |
### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with #35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: #37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#37454) ### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: apache#37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#37454) ### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: apache#37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the bug, including details regarding any error messages, version, and platform.
When running our benchmarks for ParquetSharp against the new Arrow 13.0.0 release, one of the benchmarks was twice as slow compared to Arrow 12.0.1. Digging into this further, I found that the change that caused this slow down was #35230, and it appears that the reason for the slowdown is this particular benchmark makes a lot of calls to
WriteBatch
with very small batches, and there is now a lot of extra time spent looking upproperties_->page_index_enabled(descr_->path())
inpages_change_on_record_boundaries()
.It looks to me like the column properties shouldn't change one a
TypedColumnWriter
is constructed, so it should be fine to compute this value once on construction to fix this performance regression.Component(s)
Parquet
The text was updated successfully, but these errors were encountered: