-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34051: [C++] GcsFileSystem lazily starts sequential reads #34052
Conversation
`OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow.
|
The failure in Python / AMD64 Conda Python 3.9 Sphinx & Numpydoc seems unrelated, or at least I cannot figure out how it relates to the changes in this PR. If the failure was indeed caused by this PR I would appreciate a hint in the right direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
The doc CI job failure will be fixed by #34038.
Benchmark runs are scheduled for baseline = 7423f03 and contender = 771c37a. 771c37a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…pache#34052) `OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow. Fixes apache#34051 ### What changes are included in this PR? Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs. ### Are these changes tested? I think so: the existing tests cover the affected functions. ### Are there any user-facing changes? No. * Closes: apache#34051 Authored-by: Carlos O'Ryan <coryan@google.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#34052) `OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow. Fixes apache#34051 ### What changes are included in this PR? Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs. ### Are these changes tested? I think so: the existing tests cover the affected functions. ### Are there any user-facing changes? No. * Closes: apache#34051 Authored-by: Carlos O'Ryan <coryan@google.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#34052) `OpenInputFile()` returns a `io::RandomAccessFile` which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data using `ReadAt()`. This includes applications using Parquet files via PyArrow. Fixes apache#34051 ### What changes are included in this PR? Change the GcsFileSystem class to lazily start the download used to implement the `io::InputFile` APIs. ### Are these changes tested? I think so: the existing tests cover the affected functions. ### Are there any user-facing changes? No. * Closes: apache#34051 Authored-by: Carlos O'Ryan <coryan@google.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
OpenInputFile()
returns aio::RandomAccessFile
which supports sequential reads as well as random access reads. The previous implementation eagerly started a sequential read, but many applications do not use that aspect of the API. Because GCS has fairly high latency, this can slow down applications that are only going to read data usingReadAt()
. This includes applications using Parquet files via PyArrow.Fixes #34051
What changes are included in this PR?
Change the GcsFileSystem class to lazily start the download used to implement the
io::InputFile
APIs.Are these changes tested?
I think so: the existing tests cover the affected functions.
Are there any user-facing changes?
No.
GcsFileSystem::OpenInputFile()
#34051