Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled hive splits for uncompressed CSV files with S3 Select pushdown #13754

Merged
merged 1 commit into from
Aug 30, 2022
Merged

Conversation

dnanuti
Copy link
Member

@dnanuti dnanuti commented Aug 19, 2022

Description

Scan range allows S3 Select to query uncompressed files at a finer granularity than the entire object, by providing a byte range to SelectObjectContent requests. This change enables hive internal splits for S3 Select by sending scan range requests for uncompressed CSV files.

Is this change a fix, improvement, new feature, refactoring, or other?

This PR is a performance optimization for Hive S3 Select connector with uncompressed CSV input, leveraging the scan range feature of the service. JSON support will be added in a separate PR.
File splitting is configurable on the client side through the already existing session properties, such as:

set SESSION hive.max_initial_split_size='5MB';
set SESSION hive.max_split_size='7MB';

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive S3 Select connector

How would you describe this change to a non-technical end user or system administrator?

Trino client will return results faster when S3 Select pushdown is enabled for uncompressed CSV files:
set SESSION hive.s3_select_pushdown_enabled=true;

Related issues, pull requests, and links

Accidentally closed previous PR: #13417 with a wrong fork sync.

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Enabled Hive splits for S3 Select connector by leveraging the scan range feature of the service

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@dnanuti dnanuti requested a review from findinpath August 24, 2022 13:50
@findinpath
Copy link
Contributor

nit: Please keep the number of chars per line in the commit detail less than 80 (as described in https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages)

@dnanuti
Copy link
Member Author

dnanuti commented Aug 24, 2022

nit: Please keep the number of chars per line in the commit detail less than 80 (as described in https://github.com/trinodb/trino/blob/master/.github/DEVELOPMENT.md#format-git-commit-messages)

Totally missed that, thanks a lot for flagging this, updated!

Scan range allows S3 Select to query uncompressed files at a finer granularity
than the entire object, by providing a byte range to SelectObjectContent
requests. This change enables hive internal splits for S3 Select by sending scan
range requests for uncompressed CSV files.
@arhimondr arhimondr merged commit 0b8d11c into trinodb:master Aug 30, 2022
@github-actions github-actions bot added this to the 395 milestone Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants