Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use aws mysql jdbc driver 1.0.0 HikariCP 3.4.5 only write failover time too long than HikariCP 2.7.5 #209

Closed
zixubingfeng opened this issue May 20, 2022 · 16 comments
Labels
bug Something isn't working

Comments

@zixubingfeng
Copy link

Describe the issue
use aws mysql jdbc driver HikariCP 3.4.5 only write failover time too long than HikariCP 2.7.5.

Driver Version?
1.0.0

Java Version?
java-1.8-openjdk

OS Version?
aws cloud 4.14.209-160.339.amzn2.x86_64

MySQL Version?
5.7.12-log

To Reproduce
use aws mysql jdbc driver 1.0.0 HikariCP 3.4.5,insert data and then aurora execute failover ,check recover time
use aws mysql jdbc driver 1.0.0 HikariCP 2.7.5,insert data and then aurora execute failover ,check recover time
in the end compare two result

Expected behaviour
expected use aws mysql jdbc driver 1.0.0 HikariCP 3.4.5 when happen failover , recover time can be in 30s

Logs
if not recover , when insert data , it will happen an read only exception

@karenc-bq
Copy link
Contributor

Hi @zixubingfeng, thank you for raising this issue. We're currently looking into it and will share more information once we know more about what is going on. Thank you for your patience.

@hsuamz hsuamz added the bug Something isn't working label May 30, 2022
@abedali
Copy link

abedali commented Jun 20, 2022

Hi, Any update on this? with aws jdbc mySQL driver 1.0.0 and Hikari 4.03, it took 3 mins to recover the connection after RDS AZ fail-over.

@sergiyvamz
Copy link
Contributor

Hi @abedali

We're still working on reproducing the issue and finding a solution for it. We'd appreciate if you can provide mode details. That would helps us to narrow down our investigation.

  • Exact description of RDS cluster that you use (cluster type, Single-AZ or Multi-AZ, etc.)
  • a detailed driver log
  • if possible, a code sample that reproduces the issue.

Thank you!

@sergiyvamz
Copy link
Contributor

Hello @zixubingfeng

There's a fix to the issue is available and a snapshot build is ready. Are you able to confirm if the issue is fixed?
#242
https://github.com/awslabs/aws-mysql-jdbc#using-a-snapshot-of-the-driver

@abedali
Copy link

abedali commented Jul 11, 2022

@sergiyvamz Thanks for the update. I'll let you know the feedback once I test using 1.1.1 driver.

@zixubingfeng
Copy link
Author

@abedali I test using aws-mysql-jdbc-1.1.1-20220708.155324-4.jar driver and hikaricp 3.4.5 ,The insertion request still takes 30 minutes to recover。

test code like this:
```
String userId = "08cf149b-64ac-4101-88eb-4227d243564e";
// 多线程调用
Integer threadCount = 50;
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
CountDownLatch countDownLatch = new CountDownLatch(threadCount);
for (int i = 0; i < threadCount; i++) {
executor.submit(() -> {
while (true) {
try {
String userIdNew = "test_" + UUID.randomUUID().toString();
DeviceBindingEntity deviceBindingEntity = new DeviceBindingEntity();
deviceBindingEntity.setUserId(userIdNew);
deviceBindingEntity.setAppName("com.xiaomi.hm.health");
deviceBindingEntity.setDeviceId("123");
deviceBindingEntity.setDeviceType(2);
this.deviceBindingRepository.create(deviceBindingEntity);
logger.info("insert success = " + deviceBindingEntity.getUserId());
} catch (Exception ex) {
logger.error(ex.getCause().getMessage());
}
}
});
}
countDownLatch.await();


@abedali
Copy link

abedali commented Jul 13, 2022

@sergiyvamz Please refer to an update above.
Any idea when does 1.1.1 version of driver available in maven?

@sergiyvamz
Copy link
Contributor

Hello @zixubingfeng

Can you provide driver logs and give more details about your application and libraries/framework that you use?

@sergiyvamz
Copy link
Contributor

Hello @abedali

Unfortunately there's no particular release date available at the moment. We continue working on other issues and the driver improvements and they will be available as snapshot builds.

I'll notify you when a new release is available.

@zixubingfeng
Copy link
Author

zixubingfeng commented Jul 28, 2022

"2022-07-28 17:34:34,255 WARN pool-5-thread-45 o.h.e.j.s.SqlExceptionHelper:137 - SQL Error: 1290, SQLState: HY000
2022-07-28 17:34:34,255 ERROR pool-5-thread-45 o.h.e.j.s.SqlExceptionHelper:142 - The MySQL server is running with the --read-only option so it cannot execute this statement "

I use cn region, jdbc:mysql:aws://xxx.cluster-c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn:3306/xxx?autoReconnect=true&failOverReadOnly=false&useUnicode=true&characterEncoding=UTF-8&useSSL=false&clusterInstanceHostPattern=?.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn

I am not sure , whether “HY000” of the errorcode is the problem or not.

@sergiyvamz @abedali

image

@sergiyvamz
Copy link
Contributor

Hello @zixubingfeng

Thank you for reaching out. It's hard to say whether HY000 is expected in this case or not. HY000 is quite generic error code and it could mean different causes. Would you mind to provide more details about the issue?

  • What is a scenario you refer? What does your application do?
  • Is the issue you're reporting connected to database cluster failover?
  • Could you provide driver logs to investigate the issue?
  • Do you use Hikari connection pool?
  • What DB cluster configuration do you use?
  • Is there any similarity in the issue mentioned above and the issue you're reporting?

Thank you!

@zixubingfeng
Copy link
Author

@sergiyvamz

  • My application only insert test, , for example use aws mysql jdbc driver 1.0.0 HikariCP 3.4.5 only write failover time too long than HikariCP 2.7.5 #209 (comment)
  • yes,when database cluster failover, I guess this failover scenes can not trigger connection switch ,due to throw the exception of HY000 . but Do not throw exceptions starting with "08".
  • ok, I cannot see any problems , please see awsfailoverdebug.txt
  • yes, use Hikari 3.4.5
  • I use cn region, jdbc:mysql:aws://huami-test.cluster-c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn:3306/basic_services?autoReconnect=true&failOverReadOnly=false&useUnicode=true&characterEncoding=UTF-8&useSSL=false&clusterInstanceHostPattern=?.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn
  • yes, I am trying find problem , why failover time is so long.

@sergiyvamz
Copy link
Contributor

Hello @zixubingfeng

Thank you for provided details. Based on driver log provided, the failover process works fine. Here are some extracts from the log file.

Line 32899: Fri Jul 29 11:48:18 CST 2022 - [156] - TRACE: [ClusterAwareConnectionProxy] Detected an exception while executing a command.
Line 33474: Fri Jul 29 11:48:24 CST 2022 - [156] - DEBUG: [ClusterAwareConnectionProxy] Starting writer failover procedure.
Line 33727: Fri Jul 29 11:48:25 CST 2022 - [156] - DEBUG: [ClusterAwareWriterFailoverHandler] Successfully connected to the new writer instance: 'huami-test.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn:3306'
Line 33728: Fri Jul 29 11:48:25 CST 2022 - [156] - DEBUG: [ClusterAwareConnectionProxy] Connected to: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.conf.HostInfo@735b164a :: {host: "huami-test.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn", port: 3306, hostProperties: {failOverReadOnly=false, logger=software.aws.rds.jdbc.mysql.shading.com.mysql.cj.log.StandardLogger, autoReconnect=true, sslMode=DISABLED, useSSL=false, TOPOLOGY_SERVICE_SERVER_ID=huami-test, dbname=basic_services, TOPOLOGY_SERVICE_SESSION_ID=MASTER_SESSION_ID, clusterInstanceHostPattern=?.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn, connectTimeout=0, socketTimeout=0, TOPOLOGY_SERVICE_LAST_UPDATE_TIMESTAMP=2022-07-29 03:48:25.768902, characterEncoding=UTF-8, TOPOLOGY_SERVICE_REPLICA_LAG_IN_MILLISECONDS=1.8446744073709552E16}}
Line 33729: Fri Jul 29 11:48:25 CST 2022 - [156] - ERROR: Transaction resolution unknown. Please re-configure session state if required and try restarting transaction.

As you can see the driver successfully reconnected to a new writer node "huami-test" (line 33728) and raise an exception with sqlState 08007. This code means that the failover is successful but transaction state is unknown and a user application needs to handle it. Hikari connection pool catches this exception and mark the connection as broken. As result the connection is evicted and disposed.

2022-07-29 11:48:25,769 WARN pool-5-thread-1 c.z.h.p.ProxyConnection:182 - HikariPool-1 - Connection software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ConnectionImpl@665e54e7 marked as broken because of SQLSTATE(08007), ErrorCode(0) java.sql.SQLException: Transaction resolution unknown. Please re-configure session state if required and try restarting transaction.

When a new connection is needed, Hikari connection pool opens a new connection with a connection string provided by your application. I can assume that you use a cluster writer endpoint as a connection string. The tricky point is that a cluster writer endpoint isn't up-to-date with the current state of the cluster right after database cluster failover is completed. DNS resolution uses stale data and, most probably, points your application to a wrong node. As result your application connects to a "huami-test-reader" node that is a reader. Further attempts to execute any DML statements on this node causes an exception:

java.sql.SQLException
MESSAGE: The MySQL server is running with the --read-only option so it cannot execute this statement

This explanation is based on few assumptions that I would like to confirm.

@zixubingfeng
Copy link
Author

hello @sergiyvamz

-I connect cluster endpoint not writer endpoint, and I not configure override hikari sqlerrorcode, for example: jdbc:mysql:aws://huami-test.cluster-c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn:3306/basic_services?autoReconnect=true&failOverReadOnly=false&useUnicode=true&characterEncoding=UTF-8&useSSL=false&clusterInstanceHostPattern=?.c6w5wba58yqu.rds.cn-northwest-1.amazonaws.com.cn
-I configure override hikari sqlerrorcode, and try it ,but this result still cannot connect aurora and the result of SQL Error is null.

  • another question, I restart application , to try connect db, connection still cannot recover, that why
    restartcannotconnect.txt
    .

@sergiyvamz
Copy link
Contributor

Hello @zixubingfeng

The cluster endpoint you provided is a cluster writer endpoint and it points out to a writer node. The issue with a cluster endpoint is that it isn't accurate after failover at DB cluster gets completed. It takes some time till a cluster endpoints is updated with a new elected writer node. The log you've provided shows precisely this issue. Your application uses a cluster endpoints and it actually connects to a reader node (that was a writer node before DB failover). Attempts to execute DML states on a reader nodes throws an exception:

java.sql.SQLException
MESSAGE: The MySQL server is running with the --read-only option so it cannot execute this statement

@github-actions github-actions bot added the Stale label Sep 5, 2022
@github-actions github-actions bot removed the Stale label Sep 12, 2022
@congoamz
Copy link
Contributor

congoamz commented Sep 22, 2022

The fix for this issue was merged in with this commit and will be included in the next release. I'll close this issue for now since there hasn't been any more discussion. Please reopen if the issue still persists. Thanks you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants