Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put S3 Select for TEXTFILE behind an experimental flag #18102

Merged
merged 3 commits into from
Jul 26, 2023

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented Jun 30, 2023

Description

Fixes some correctness issues in JSON pushdown related to quote characters and gate pushdown for TEXTFILE (using CSV in S3 Select) behind a separate HiveConfig property.

Additional context and related issues

Relates to: #17775
Based on: #17563

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix correctness issue when using S3 Select and query predicates include a quote character. ({issue}`17775`)
* Fix correctness issue when using S3 Select and query predicates include a decimal column. ({issue}`17775`)
* Add an additional opt-in property to enable S3 Select for TEXTFILE tables. ({issue}`17775`)

@cla-bot cla-bot bot added the cla-signed label Jun 30, 2023
@alexjo2144 alexjo2144 requested a review from electrum June 30, 2023 16:07
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch 3 times, most recently from a61f404 to 8017b1f Compare June 30, 2023 19:32
@findepi
Copy link
Member

findepi commented Jul 3, 2023

/test-with-secrets sha=8017b1f1e61240a2150029cdfa4ed3e907274876

@github-actions
Copy link

github-actions bot commented Jul 3, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5444035845

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from 8017b1f to 38bd9f9 Compare July 5, 2023 16:58
@findepi findepi changed the title Put s3 Select for CSV files behind an experimental flag Put S3 Select for CSV files behind an experimental flag Jul 6, 2023
@findepi
Copy link
Member

findepi commented Jul 6, 2023

/test-with-secrets sha=38bd9f9a86b5a214573ff3c3359c902980e470b1

@github-actions
Copy link

github-actions bot commented Jul 6, 2023

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/5472865441

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize "S3 Select" in commit titles

@electrum electrum changed the title Put S3 Select for CSV files behind an experimental flag Put S3 Select for TEXTFILE behind an experimental flag Jul 6, 2023
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from 38bd9f9 to e72bb89 Compare July 6, 2023 20:05
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Disable S3 Select pushdown on decimal columns"

this requires blind correctness tests.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Put S3 Select for CSV files behind an experimental flag"

docs/src/main/sphinx/connector/hive-s3.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/hive-s3.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/connector/hive.rst Outdated Show resolved Hide resolved
}

@Config("hive.s3select-pushdown.experimental-pushdown-enabled")
@ConfigDescription("Enable query pushdown to TEXTFILE tables using the AWS S3 Select service. Requires 'hive.s3select-pushdown.enabled' also be set.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit Requires 'hive.s3select-pushdown.enabled' also be set. here. It's kind of obvious.

Also, it's possible to enforce using @AssertTrue annotation (add a test, since recent changes from javax to jakarta broke some validations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit Requires 'hive.s3select-pushdown.enabled' also be set. here. It's kind of obvious.

Sure

Also, it's possible to enforce using @AssertTrue annotation (add a test, since recent changes from javax to jakarta broke some validations)

I'd rather not require this, as the main toggle can also be enabled/disabled using a session property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes things more complicated, agreed. so in the code we cannot assume one is enabled only IF the other is
however, that doesn't mean we should be accepting catalog configuration that makes no sense. no need to & easy to address

Copy link
Member Author

@alexjo2144 alexjo2144 Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still makes sense to have the main on/off toggle default to off but have textfile pushdown enabled. For example if you want to test the experimental feature using the session property but want it to stay disabled by default.

docs/src/main/sphinx/connector/hive-s3.rst Show resolved Hide resolved
Comment on lines 1938 to 1939
// These two should return a result, but incorrectly return nothing
assertThat(query(withS3SelectPushdown, "SELECT id FROM " + table.getName() + " WHERE string_t ='a,comma'")).returnsEmptyResult();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • verify in the test that results are incorrect: run a query without s3select pushdown and assert actual results, so that the test self-validates the comment "... but incorrectly return ..."

  • We also need a blind correctness test ensuring that when S3Select is enabled (but the experimental pushdown is not), results are correct.

    • In other words -- which test would fail if I simply change HiveConfig#s3SelectExperimentalPushdownEnabled default value? Which test prevents me from enabling it before correctness problems are fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, adding the first bullet exposed a bug in the native textfile reader implementation, submitted a separate issue: #18215

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a blind correctness test ensuring that when S3Select is enabled (but the experimental pushdown is not), results are correct.

Added a separate test class for this, since the experimental flag is not configurable by a session property.

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from e72bb89 to 7d4dad8 Compare July 7, 2023 19:49
Minio tests produced the correct results, however tests against a real
S3 bucket did not.
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from 7d4dad8 to 3557abf Compare July 10, 2023 18:03
@electrum
Copy link
Member

Can you check the test failures?

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from 3557abf to d5aea6b Compare July 10, 2023 20:46
@alexjo2144
Copy link
Member Author

Yeah, I forgot to check my RST formatting. Should be good this time.

@alexjo2144 alexjo2144 requested a review from findepi July 14, 2023 19:02
@findepi
Copy link
Member

findepi commented Jul 21, 2023

/test-with-secrets sha=c65d4ba9834ed5433ba23533f35d1b0eb35cd641

@github-actions
Copy link

github-actions bot commented Jul 21, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5622741842

S3 Select queries on CSV files are shown to have correctness
problems. JSON files can still be enabled/disabled using the
existing config and session properties.
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-config branch from c65d4ba to eeda594 Compare July 21, 2023 19:22
@alexjo2144
Copy link
Member Author

Test failure was because I forgot to update one of the tests when I changed the config property name...

@alexjo2144
Copy link
Member Author

@electrum or @ebyhr can I get another test with secrets kicked off?

@electrum
Copy link
Member

/test-with-secrets sha=eeda5941b6e05f7ec28f4d07adbf7c3b146eca7c

@github-actions
Copy link

github-actions bot commented Jul 24, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5650768931

@alexjo2144
Copy link
Member Author

Failure is unrelated, had to do with: #18227

@electrum electrum merged commit 406861a into trinodb:master Jul 26, 2023
@github-actions github-actions bot added this to the 423 milestone Jul 26, 2023
@@ -1799,14 +1800,14 @@ public void testS3SelectPushdown(String tableProperties)
.setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that this class grew beyond its initial purpose, this should be the class default and the insert-overwrite tests should pick OVERWRITE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants