Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix correctness issues in S3 Select pushdown #17563

Merged
merged 3 commits into from
Jul 5, 2023

Conversation

alexjo2144
Copy link
Member

@alexjo2144 alexjo2144 commented May 18, 2023

Description

The IonSqlQueryBuilder would produce select queries which were not proper transormations of the given TupleDomain, leading to incorrect results when S3 Select was enabled.

When reading JSON files predicates like x IS NULL or x IS NOT NULL were evaluated as x = '' or x <> ''.

When reading TextFile data the query builder ignores the table's null_format field, instead assuming that null fields are encoded as the empty string.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix incorrect query results when s3 Select is used with `IS NULL` or `IS NOT NULL` predicates.
* Fix incorrect query results when using s3 Select and the table's null format is set.

@cla-bot cla-bot bot added the cla-signed label May 18, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels May 18, 2023
@alexjo2144 alexjo2144 requested review from electrum and findepi May 18, 2023 22:19
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from b91bdd1 to 7101e0d Compare May 18, 2023 22:23
@alexjo2144
Copy link
Member Author

@findepi mind starting a run with secrets?

@alexjo2144
Copy link
Member Author

Never mind, let me fix the IonSqlQueryBuilder tests first

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch 2 times, most recently from 69163b9 to f39b46e Compare May 25, 2023 22:38
@alexjo2144
Copy link
Member Author

alexjo2144 commented May 25, 2023

Updated with more Minio tests and some refactoring that makes the queries pushed into S3 a little cleaner.

There are still some failures when I run that test against real s3 manually though:

  • Equality predicates on decimal types like WHERE decimal_t = 2.2 or WHERE decimal_t <= 2.2 produces incorrect results
  • The test fails entirely when using the TextFile serde against real s3. It can't even parse the files

I think these changes are an improvement but there are still bugs to squash

@findepi
Copy link
Member

findepi commented May 26, 2023

We can run these tests agains MinIO for developers convenience, but will CI run them against real S3?
S3 Select pushdown looks like a complex functionality, so we shouldn't blindly assume S3-compatible is equivalent of S3.

@alexjo2144
Copy link
Member Author

I'd like to do both Minio and S3 tests, but honestly I've been finding enough wrong with the feature I'm tempted to say we should just take it out.

@findepi
Copy link
Member

findepi commented May 26, 2023

Agreed. It's hard to take things, so lets rename to "experimental" the config toiggles and session properties

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from f39b46e to fa43ed3 Compare May 26, 2023 15:52
@alexjo2144
Copy link
Member Author

@findepi can I get a run with secrets when you get a chance?

@findepi
Copy link
Member

findepi commented May 26, 2023

/test-with-secrets sha=fa43ed349ed9918bb0341250d97246e27d549043

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from fa43ed3 to 56a0f20 Compare June 6, 2023 19:58
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from 56a0f20 to 9235a10 Compare June 6, 2023 20:11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another test file with the same naming: https://github.com/trinodb/trino/blob/master/plugin/trino-hive/src/test/java/io/trino/plugin/hive/s3select/TestS3SelectPushdown.java

Would you please rename this to make sure we don't exclude the tests above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

plugin/trino-hive/pom.xml Outdated Show resolved Hide resolved
@alexjo2144
Copy link
Member Author

alexjo2144 commented Jun 7, 2023

@dnanuti I think the main issue around some of these edge cases with whitespace or null encoding is that the s3 select CSV encoding is mapped in Trino to textfile. This works okay if, like in the test I wrote, the table explicitly defines a field separator, escape character, and null encoding, but if those are not explicitly defined csv and textfile have different defaults so data will not behave properly.

Handling for CSVs with whitespace or padding could be an issue, but before these changes data doesn't round-trip properly, even when written by Trino and read back by Trino. This is a much bigger problem.

assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE string_t IS NOT NULL", "VALUES 1, 2, 4");

// TODO: Pushdown with equality predicates on decimal types produces incorrect results. https://github.com/trinodb/trino/issues/17775
// assertS3SelectQuery("SELECT id FROM " + table.getName() + " WHERE decimal_t = 2.2", "VALUES 2");
Copy link
Member Author

@alexjo2144 alexjo2144 Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnanuti for this query, against a JSON table I've tried:
SELECT s.id, s.decimal_t FROM S3Object s WHERE s.decimal_t IS NOT NULL AND s.decimal_t = 2.20000
or
SELECT s.id, s.decimal_t FROM S3Object s WHERE s.decimal_t IS NOT NULL AND CAST(s.decimal_t AS DECIMAL(10, 5)) = 2.20000

But I get no rows back from either. Here's the table files:
json_table.tar.gz

I'd expect to get back the value 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll check tomorrow!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the chance to look into this today, as there are other priorities right now. I'll open an issue in our backlog for further investigation. Thanks for calling this out! 👍

@dnanuti
Copy link
Member

dnanuti commented Jun 7, 2023

There are some Docker tests define that were running as part of the CI after re-enabling Select pushdown, where you can find tables creation statements that are not defining all the separators you mentioned.

Tables creation here:
https://github.com/trinodb/trino/blob/master/plugin/trino-hive-hadoop2/bin/run_hive_s3_tests.sh
Tests implementation here:
https://github.com/trinodb/trino/tree/master/plugin/trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/s3select

Please enhance this if possible with the scenarios that were not covered before.

I'll double check about encoding tomorrow, but sounds reasonable.

@alexjo2144
Copy link
Member Author

@findepi @electrum there are some lingering issues to fix but they're all preexisting things continued by #17775

When you get a chance another round of tests with secrets and review would be great.

@alexjo2144
Copy link
Member Author

Thanks @findepi , all set

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from f8a58b8 to 5ca6c5b Compare June 13, 2023 18:31
@alexjo2144
Copy link
Member Author

Rebased for conflicts.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fixup! Fix correctness issues in S3 Select pushdown" lgtm

@findepi
Copy link
Member

findepi commented Jun 14, 2023

/test-with-secrets sha=5ca6c5b9fdc468f3e180138f3f1fb216256d759b

@findepi
Copy link
Member

findepi commented Jun 14, 2023

i expect the above to fail per #17563 (comment)

@github-actions
Copy link

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/5268695780

@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from 5ca6c5b to 8f135cd Compare June 20, 2023 16:43
@findepi
Copy link
Member

findepi commented Jun 21, 2023

are ci / pt (default, suite-delta-lake-oss, ) failures related?

The IonSqlQueryBuilder would produce select queries which were not
proper transormations of the given TupleDomain, leading to incorrect
results when S3 Select was enabled.

When reading JSON files predicates like `x IS NULL` or `x IS NOT NULL`
were evaluated as `x = ''` or `x <> ''`.

When reading TextFile data the query builder ignores the table's
`null_format` field, instead assuming that null fields are encoded as
the empty string.# Please enter the commit message for your changes. Lines starting
@alexjo2144 alexjo2144 force-pushed the hive/s3-select-correctness branch from 8f135cd to d41a568 Compare June 26, 2023 16:45
@findepi findepi merged commit 5b19786 into trinodb:master Jul 5, 2023
@github-actions github-actions bot added this to the 421 milestone Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants