Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Databricks query for 503 retry after without Retry-After header #14392

Closed
wants to merge 1 commit into from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Sep 30, 2022

Description

Fixes #14391

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

// Databricks JDBC driver retries operations that receive HTTP 503 responses
// if the server response is returned with Retry-After headers by default.
// Following policy retries only once when received 'HTTP retry after response' without Retry-After headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line so that the comment is close to the entity it documents

.handleIf(throwable -> throwable.getMessage().contains("HTTP Response code: 502"))
.withBackoff(1, 10, ChronoUnit.SECONDS)
.withMaxRetries(60)
.onRetry(event -> log.warn(event.getLastFailure(), "Query failed on attempt %d, will retry.", event.getAttemptCount()));

// Databricks JDBC driver retries operations that receive HTTP 503 responses
// if the server response is returned with Retry-After headers by default.
// Following policy retries only once when received 'HTTP retry after response' without Retry-After headers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to retry "even more" than the driver does by itself, right?
I.e. we retry on top of driver's retries?

  • let's make it explicit in the comment
  • is there a way to configure the driver to retry for a longer time than it does by default?

Copy link
Member Author

@ebyhr ebyhr Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDBC driver wouldn't retry when the response doesn't have Retry-After headers. Added retry policy covers the situation.

We can increase TemporarilyUnavailableRetryTimeout (900s by default) for failure having Retry-After headers. However, increasing the value won't resolve the recent flaky issue.

@ebyhr ebyhr force-pushed the ebi/delta-databricks-retry branch from 52a6757 to db06e4c Compare September 30, 2022 08:59
// Databricks JDBC driver retries operations that receive HTTP 503 responses
// if the server response is returned WITH Retry-After headers by default.
// Following policy retries only once when received 'HTTP retry after response' WITHOUT Retry-After headers.
RetryPolicy<QueryResult> databricks503RetryPolicy = new RetryPolicy<QueryResult>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you happen to spot this issue several times in the builds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr
Copy link
Member Author

ebyhr commented Sep 30, 2022

It seems retrying is not safe in case of write operations. The actual row count is double when retried INSERT statement. I will take another look.

tests               | 2022-09-30 23:51:22 WARNING: Query failed on attempt 1, will retry.
tests               | io.trino.tempto.query.QueryExecutionException: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: HTTP retry after response received with no Retry-After header, error: HTTP Response code: 503, Error message: Unknown.
tests               | 	at io.trino.tempto.query.JdbcQueryExecutor.execute(JdbcQueryExecutor.java:119)
tests               | 	at io.trino.tempto.query.JdbcQueryExecutor.executeQuery(JdbcQueryExecutor.java:84)
tests               | 	at io.trino.tests.product.utils.QueryExecutors$3.lambda$executeQuery$0(QueryExecutors.java:149)
tests               | 	at net.jodah.failsafe.Functions.lambda$get$0(Functions.java:48)
tests               | 	at net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:62)
tests               | 	at net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:62)
tests               | 	at net.jodah.failsafe.Execution.executeSync(Execution.java:129)
tests               | 	at net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376)
tests               | 	at net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:67)
tests               | 	at io.trino.tests.product.utils.QueryExecutors$3.executeQuery(QueryExecutors.java:149)
tests               | 	at io.trino.tests.product.deltalake.TestDeltaLakeWriteDatabricksCompatibility$CaseTestTable.<init>(TestDeltaLakeWriteDatabricksCompatibility.java:366)
tests               | 	at io.trino.tests.product.deltalake.TestDeltaLakeWriteDatabricksCompatibility.testCaseUpdateInPartition(TestDeltaLakeWriteDatabricksCompatibility.java:160)
tests               | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
tests               | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
tests               | 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
tests               | 	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
tests               | 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
tests               | 	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
tests               | 	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
tests               | 	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
tests               | 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
tests               | 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
tests               | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
tests               | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
tests               | 	at java.base/java.lang.Thread.run(Thread.java:833)
tests               | Caused by: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: HTTP retry after response received with no Retry-After header, error: HTTP Response code: 503, Error message: Unknown.
tests               | 	at com.databricks.client.hivecommon.api.HS2Client.handleTTransportException(Unknown Source)
tests               | 	at com.databricks.client.spark.jdbc.DowloadableFetchClient.handleTTransportException(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.api.HS2Client.executeStatementInternal(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.api.HS2Client.executeStatement(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.dataengine.HiveJDBCNativeQueryExecutor.executeRowCountQueryHelper(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.dataengine.HiveJDBCNativeQueryExecutor.execute(Unknown Source)
tests               | 	at com.databricks.client.jdbc.common.SStatement.executeNoParams(Unknown Source)
tests               | 	at com.databricks.client.jdbc.common.BaseStatement.execute(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.jdbc42.Hive42Statement.execute(Unknown Source)
tests               | 	at io.trino.tempto.query.JdbcQueryExecutor.executeQueryNoParams(JdbcQueryExecutor.java:128)
tests               | 	at io.trino.tempto.query.JdbcQueryExecutor.execute(JdbcQueryExecutor.java:112)
tests               | 	... 24 more
tests               | 	Suppressed: java.lang.Exception: Query: INSERT INTO default.update_case_compat_zk3lu03mfzd5 VALUES (1, 1, 0), (2, 2, 0), (3, 3, 1)
tests               | 		at io.trino.tempto.query.JdbcQueryExecutor.executeQueryNoParams(JdbcQueryExecutor.java:136)
tests               | 		... 25 more
tests               | Caused by: com.databricks.client.support.exceptions.ErrorException: [Databricks][DatabricksJDBCDriver](500593) Communication link failure. Failed to connect to server. Reason: HTTP retry after response received with no Retry-After header, error: HTTP Response code: 503, Error message: Unknown.
tests               | 	... 35 more
tests               | Caused by: com.databricks.client.jdbc42.internal.apache.thrift.transport.TTransportException: HTTP retry after response received with no Retry-After header, error: HTTP Response code: 503, Error message: Unknown
tests               | 	at com.databricks.client.hivecommon.HttpRetrySettings.shouldRetry(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.api.HS2ClientWrapper.shouldReexecuteRequest(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.api.HS2ClientWrapper.executeWithRetry(Unknown Source)
tests               | 	at com.databricks.client.hivecommon.api.HS2ClientWrapper.ExecuteStatement(Unknown Source)
tests               | 	... 33 more
tests               | 
presto-master       | 2022-09-30T23:51:26.057+0545	INFO	dispatcher-query-95	io.trino.event.QueryMonitor	TIMELINE: Query 20220930_180624_00327_g8sm2 :: FINISHED :: elapsed 1804ms :: planning 71ms :: waiting 314ms :: scheduling 376ms :: running 628ms :: finishing 729ms :: begin 2022-09-30T23:51:24.252+05:45 :: end 2022-09-30T23:51:26.056+05:45
presto-master       | 2022-09-30T23:51:29.195+0545	INFO	dispatcher-query-59	io.trino.event.QueryMonitor	TIMELINE: Query 20220930_180628_00328_g8sm2 :: FINISHED :: elapsed 419ms :: planning 6ms :: waiting 176ms :: scheduling 256ms :: running 156ms :: finishing 1ms :: begin 2022-09-30T23:51:28.773+05:45 :: end 2022-09-30T23:51:29.192+05:45
tests               | 2022-09-30 23:51:31 INFO: not retrying; @Flaky annotation not present
tests               | 2022-09-30 23:51:31 INFO: FAILURE     /    io.trino.tests.product.deltalake.TestDeltaLakeWriteDatabricksCompatibility.testCaseUpdateInPartition [downpart] (Groups: profile_specific_tests, delta-lake-databricks) took 13.1 seconds
tests               | 2022-09-30 23:51:31 SEVERE: Failure cause:
tests               | org.assertj.core.api.SoftAssertionError: 
tests               | The following 2 assertions failed:
tests               | 1) [Data accessible via Databricks] Expected row count to be <3>, but was <6>; rows=[[2, 2, 0], [2, 2, 0], [3, 3, 1], [3, 3, 1], [1, 0, 0], [1, 0, 0]]
tests               | at TestDeltaLakeWriteDatabricksCompatibility.lambda$assertTable$12(TestDeltaLakeWriteDatabricksCompatibility.java:322)
tests               | 2) [Data accessible via Trino] Expected row count to be <3>, but was <6>; rows=[[1, 0, 0], [2, 2, 0], [2, 2, 0], [3, 3, 1], [1, 0, 0], [3, 3, 1]]
tests               | at TestDeltaLakeWriteDatabricksCompatibility.lambda$assertTable$13(TestDeltaLakeWriteDatabricksCompatibility.java:327)

@ebyhr
Copy link
Member Author

ebyhr commented Oct 26, 2022

I will look for another solution as retrying for all 503 causes different issue.

@ebyhr ebyhr closed this Oct 26, 2022
@ebyhr ebyhr deleted the ebi/delta-databricks-retry branch October 26, 2022 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Flaky "Communication link failure" in Databricks tests
3 participants