Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36819: [R] Use RunWithCapturedR for reading Parquet files #37274

Merged
merged 3 commits into from
Sep 5, 2023

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Aug 21, 2023

Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

Are these changes tested?

The changes are tested in the current suite.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #36819 has been automatically assigned in GitHub to PR creator.

@paleolimbot paleolimbot marked this pull request as ready for review August 23, 2023 17:18
@nealrichardson
Copy link
Member

I'm not familiar enough with parquet-cpp to know which methods and under what conditions Parquet file reading is multithreaded--I'm actually surprised it's not all multithreaded. @pitrou @jorisvandenbossche what do you think?

Naively, if ReadTable needs this treatment, I'd expect that ReadRowGroup and ReadColumn would too.

@paleolimbot
Copy link
Member Author

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Aug 28, 2023

Benchmark runs are scheduled for commit 775db90. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@paleolimbot
Copy link
Member Author

I believe the fact that the parquet reader issues reads on the calling (or any non-IO) thread is considered a bug (#30496). Good catch on issuing all read calls in the same way!

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 775db90.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details.

@jorisvandenbossche
Copy link
Member

AFAIK all of the FileReader::ReadTable, ReadRowGroup, etc methods can read columns in parallel depending on a setting on the reader object (FileReader::set_use_threads, which in pyarrow is exposed as a use_threads keyword in the single file Parquet reader).

This for the plain Parquet reader, not the Parquet scanning through dataset, which I think has its own logic of threading.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot !

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Sep 1, 2023
@paleolimbot paleolimbot merged commit b5d36e9 into apache:main Sep 5, 2023
11 of 12 checks passed
@paleolimbot paleolimbot removed the awaiting merge Awaiting merge label Sep 5, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b5d36e9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37274)

### Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

### What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

### Are these changes tested?

The changes are tested in the current suite.

### Are there any user-facing changes?

No.
* Closes: apache#36819

Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#37274)

### Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

### What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

### Are these changes tested?

The changes are tested in the current suite.

### Are there any user-facing changes?

No.
* Closes: apache#36819

Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Reading large Parquet files from a seekable connection fails
5 participants