Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34051: [C++] GcsFileSystem lazily starts sequential reads #34052

Merged
merged 2 commits into from
Feb 7, 2023
Merged

GH-34051: [C++] GcsFileSystem lazily starts sequential reads #34052

merged 2 commits into from
Feb 7, 2023

Conversation

coryan
Copy link
Contributor

@coryan coryan commented Feb 6, 2023

OpenInputFile() returns a io::RandomAccessFile which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using ReadAt(). This includes applications using Parquet files via PyArrow.

Fixes #34051

What changes are included in this PR?

Change the GcsFileSystem class to lazily start the download used to implement the io::InputFile APIs.

Are these changes tested?

I think so: the existing tests cover the affected functions.

Are there any user-facing changes?

No.

`OpenInputFile()` returns a `io::RandomAccessFile` which supports
sequential reads as well as random access reads. The previous
implementation eagerly started a sequential read, but many applications
do not use that aspect of the API. Because GCS has fairly high latency,
this can slow down applications that are only going to read data using
`ReadAt()`. This includes applications using Parquet files via PyArrow.
@github-actions
Copy link

github-actions bot commented Feb 6, 2023

@github-actions
Copy link

github-actions bot commented Feb 6, 2023

⚠️ GitHub issue #34051 has been automatically assigned in GitHub to PR creator.

@coryan coryan marked this pull request as ready for review February 6, 2023 20:10
@coryan
Copy link
Contributor Author

coryan commented Feb 6, 2023

The failure in Python / AMD64 Conda Python 3.9 Sphinx & Numpydoc seems unrelated, or at least I cannot figure out how it relates to the changes in this PR. If the failure was indeed caused by this PR I would appreciate a hint in the right direction.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The doc CI job failure will be fixed by #34038.

@kou kou merged commit 771c37a into apache:master Feb 7, 2023
@coryan coryan deleted the feat-gcsfs-lazy-start-on-input-file branch February 7, 2023 01:35
@ursabot
Copy link

ursabot commented Feb 7, 2023

Benchmark runs are scheduled for baseline = 7423f03 and contender = 771c37a. 771c37a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.67% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.65% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 771c37aa ec2-t3-xlarge-us-east-2
[Failed] 771c37aa test-mac-arm
[Finished] 771c37aa ursa-i9-9960x
[Finished] 771c37aa ursa-thinkcentre-m75q
[Finished] 7423f033 ec2-t3-xlarge-us-east-2
[Failed] 7423f033 test-mac-arm
[Finished] 7423f033 ursa-i9-9960x
[Finished] 7423f033 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Feb 10, 2023
…pache#34052)

`OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow.

Fixes apache#34051 

### What changes are included in this PR?

Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs.

### Are these changes tested?

I think so: the existing tests cover the affected functions.

### Are there any user-facing changes?

No.

* Closes: apache#34051

Authored-by: Carlos O'Ryan <coryan@google.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…pache#34052)

`OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow.

Fixes apache#34051 

### What changes are included in this PR?

Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs.

### Are these changes tested?

I think so: the existing tests cover the affected functions.

### Are there any user-facing changes?

No.

* Closes: apache#34051

Authored-by: Carlos O'Ryan <coryan@google.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…pache#34052)

`OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow.

Fixes apache#34051 

### What changes are included in this PR?

Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs.

### Are these changes tested?

I think so: the existing tests cover the affected functions.

### Are there any user-facing changes?

No.

* Closes: apache#34051

Authored-by: Carlos O'Ryan <coryan@google.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Avoid unnecessary downloads in GcsFileSystem::OpenInputFile()
3 participants