Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade kudu client to 1.15.0 #10940

Merged
merged 2 commits into from
Feb 16, 2022

Conversation

grantatspothero
Copy link
Contributor

@grantatspothero grantatspothero commented Feb 3, 2022

Upgrading the kudu client revealed a few problems:

  1. Timeouts to kudu tablets were sometimes occurring during deletes due to this change introduced in the kudu java client in version 1.13.0: apache/kudu@d23ee5d#diff-f1f50409d81052b8f8d7aea7da663c185c704c6206cb0ec901114f4d9ee8c28f
    (see here for the reason why that commit broke the client: https://gerrit.cloudera.org/#/c/18166/)

  2. Timeouts were not failing query execution because the kudu connector was configured to flush operations in the background.

  3. The two combined above meant tests that did deletes sometimes actually did not perform deletes and would fail.

The first commit upgrades the kudu client and additionally explicitly fails trino execution when kudu rpcs timeout. This resolves: #5687

The second commit fixes the problem of one source of kudu timeouts in the 1.15.0 client .

@cla-bot cla-bot bot added the cla-signed label Feb 3, 2022
@grantatspothero
Copy link
Contributor Author

Want to let the CI process run so there is a build showing the timeouts causing query failure.

After that, will push another commit to actually resolve the timeout problems by setting KuduScanToken.KuduScanTokenBuilder.includeTabletMetadata(false)

@grantatspothero
Copy link
Contributor Author

grantatspothero commented Feb 3, 2022

Also if we want to split the kudu client upgrade from the background flushing problem let me know.

There just is not a way to [easily] confirm the background flushing problem with kudu client 1.10.0 since the timeouts never/rarely happen.

@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch from de3d4de to d32db3f Compare February 7, 2022 16:08
@@ -103,7 +106,7 @@ private KuduPageSink(

this.table = table;
this.session = clientSession.newSession();
this.session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND);
this.session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this introduced in the latest version ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this config also existed in 1.10.0.

See the PR description, the only way I could reproduce this bug (#5687) was by upgrading the kudu client to 1.15.0 which triggered timeouts to kudu.

That is why I fixed this bug in this upgrade client PR.

@@ -125,7 +128,10 @@ private KuduPageSink(
}

try {
session.apply(upsert);
OperationResponse response = session.apply(upsert);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract it from the version update ? It looks like OperationResponse is returned in old APIs too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, there is no way to easily verify the fix because you need a very specific sequence of timeouts [no timeouts when initially connecting to kudu, but timeouts during deletes/upserts/etc]. The kudu 1.15.0 client [accidentally] provides those timeouts due to a bug in the client.

See here for the failing kudu tests when upgrading to kudu 1.15.0:
https://github.com/trinodb/trino/runs/5057440792?check_suite_focus=true

And here for the workaround that makes tests pass:
https://github.com/trinodb/trino/runs/5060298096?check_suite_focus=true

If we do not mind having no tests around the change, I can extract this into a separate PR.

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all needs to be squashed except the last commit.

LGTM

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % doc update.

Please update "Requirements" in Kudu docs to specify minimum supported version as 1.15 now. It might work with older versions but we don't test it now - if we intend to claim otherwise then add a test extending BaseConnectorSmokeTest with an older Kudu version.

@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch from 5e2a90c to 6e51247 Compare February 11, 2022 17:11
@grantatspothero
Copy link
Contributor Author

LGTM % doc update.

Please update "Requirements" in Kudu docs to specify minimum supported version as 1.15 now. It might work with older versions but we don't test it now - if we intend to claim otherwise then add a test extending BaseConnectorSmokeTest with an older Kudu version.

Kudu 1.10.0 is pretty old (>2years, released on November 1, 2019) so I just updated the docs to say we only support 1.15.0 or higher

@hashhar
Copy link
Member

hashhar commented Feb 11, 2022

That's fair. 1.13 is oldest supported release anyway according to https://kudu.apache.org/releases.

@grantatspothero
Copy link
Contributor Author

That's fair. 1.13 is oldest supported release anyway according to https://kudu.apache.org/releases.

If we care about supporting 1.13 lmk, I can add an additional set of smoke tests in.

@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch from 6e51247 to 07dff60 Compare February 11, 2022 22:30
@hashhar
Copy link
Member

hashhar commented Feb 12, 2022

That's fair. 1.13 is oldest supported release anyway according to https://kudu.apache.org/releases.

If we care about supporting 1.13 lmk, I can add an additional set of smoke tests in.

Would be nice to do if it's not too much work otherwise we can tackle that separately. Please remember to update docs accordingly.

@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch from 07dff60 to 5d749ef Compare February 14, 2022 22:27
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments.

I believe #10953 depends on this PR getting merged first?

Copy link
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch 3 times, most recently from c29de25 to 796587a Compare February 15, 2022 19:55
Upgrading the kudu client revealed a few problems:
1. Timeouts to kudu tablets were sometimes occurring during deletes due
to a bug in the kudu java client in version 1.13.0.

2. Timeouts were *not* failing query execution because the kudu
connector was configured to flush operations in the background.

3. The two combined above meant tests that did deletes sometimes
actually did not perform deletes and would fail.

This patch upgrades the kudu client, explicitly fails trino execution
when kudu rpcs timeout, and marks unsupported data types from kudu
1.15.0.
Does not do anything in kudu 1.15.0
@grantatspothero grantatspothero force-pushed the gn/updateKuduVersion1.15.0 branch from 796587a to 1e28ba5 Compare February 15, 2022 23:35
Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
private static final String KUDU_VERSION = "1.13.0";

public static class TestKuduSmokeTestWithDisabledInferSchema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is not defined as a top-level class?

Copy link
Member

@hashhar hashhar Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #10940 (comment). To keep all tests for same version together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing a constant doesn't require nesting classes.
i admire cleverness, but this also means unnecessary class hierarchy, which doesn't help browse the code.

if the paradigm was more frequent in the code base, i would probably get used to it and wouldn't complain.

Copy link
Contributor Author

@grantatspothero grantatspothero Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed depending on how you run the test:
mvn test -Dtest=io.trino.plugin.kudu.KuduLatestConnectorTests
mvn test
some tests will be skipped in the class hierarchy.

(EDIT: BaseConnectorTest has a test to ensure the class name ends in ConnectorTest, for some reason when this is not satisfied for a static inner test class no test failure happens and instead the test gets skipped)

Additionally, when trying to run tests through intellij's UI you can only run the static inner classes (not the top level class).

Seems like the tooling support for static inner test classes is just not good, I'm going to move these to top level classes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Kudu connector bug: kudu insert operation timeout,bug result is success
5 participants