Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterAllFailedError on version 4.24.1 #1330

Closed
sundeqvist opened this issue Apr 8, 2021 · 11 comments
Closed

ClusterAllFailedError on version 4.24.1 #1330

sundeqvist opened this issue Apr 8, 2021 · 11 comments
Labels

Comments

@sundeqvist
Copy link

sundeqvist commented Apr 8, 2021

Hey! We're using ioredis with our AWS ElastiCache cluster running 3 shards on version 5.0.5. We're making the redis calls through a lambda with quite high traffic, meaning multiple concurrent lambdas running.

I've done some version bumping and I'm seeing the error ClusterAllFailedError: Failed to refresh slots cache intermittently. Through some debugging I've narrowed version 4.24.1 to be the culprit - any version before that works fine. When setting DEBUG=ioredis:* in the lambda env the ClusterAllFailedError: Failed to refresh slots cache is in most cases followed by these logs:

ioredis:cluster:connectionPool Reset with []
ioredis:cluster:connectionPool Disconnect <ip>:<port> because the node does not hold any slot
ioredis:cluster:connectionPool Remove <ip>:<port> from the pool

When looking at the 4.24.1 commit 8524eea I can tell that code related to this error has been touched - could this fix have introduced unintended issues? Any pointers would be appreciated 👍

@luin
Copy link
Collaborator

luin commented Apr 8, 2021

Hi @sundeqvist, thanks for raising this issue!

Before 4.24.1, ioredis asked cluster nodes for cluster slot information when connecting and periodically after connected. If all cluster nodes failed to provide the information (ex all nodes were down), ioredis would raise the "Failed to refresh slots cache" error and reconnect to the cluster (and print debug log Reset with [] ) if it hadn't connected, otherwise (when running periodically) it would just ignore.

However, after 4.24.1, ioredis will raise and reconnect to the cluster even the cluster has already connected. This change is introduced to make failover detection faster.

For your case, I'd suggest listen to the "node error" event (cluster.on('node error', err => console.error(err))) and see what errors cause the issue. This event will be emitted every time a cluster node fails to provide slot information.

@lvkmahesh
Copy link

lvkmahesh commented Apr 14, 2021

Hello,
We also facing same issue after 4.24.1. We are using AWS ElastiCache cluster. This is happening intermittently, when this happens our clusterRetryStrategy function is getting called only once. And we keep on getting "ClusterAllFailedError: Failed to refresh slots cache" error in error handler. Also, this is not happening in all of the containers of service, but in some of them only. For us, once broken, the retry is not working, even if cluster is healthy. Please advise.

this._client = new Redis.Cluster( [{ host: redisConfig.host, port: redisConfig.port }], { clusterRetryStrategy: function (retryCount) { console.log('retrying for redis connection') return Math.min(100 + retryCount * 2, 5000) }, enableReadyCheck: true } )
this._client.on('ready', () => { // able to connect console.log('successfully connected to redis, cache') })
this._client.on('error', (error) => { // error while connecting to redis console.error('error recieved from redis, cache', { err: error.message, errStack: error.stack }) })

@luin
Copy link
Collaborator

luin commented Apr 14, 2021

Hi @lvkmahesh , I just had a talk with @leibale about the issue. Actually, can you add a listener for the "node error" event (just like the reply I posted) and post the errors here? We'd like to better understand what caused the issue.

@sundeqvist
Copy link
Author

Thanks for looking at this so quickly. The trace that is emitted in the node error event looks as follows:

Error: timeout
at Object.timeout (/var/task/node_modules/ioredis/built/utils/index.js:159:38)
at Cluster.getInfoFromNode (/var/task/node_modules/ioredis/built/cluster/index.js:660:55)
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:395:19)
at Cluster.refreshSlotsCache (/var/task/node_modules/ioredis/built/cluster/index.js:414:9)
at Timeout._onTimeout (/var/task/node_modules/ioredis/built/cluster/index.js:108:22)
at listOnTimeout (internal/timers.js:554:17)
at processTimers (internal/timers.js:497:7)

@artur-ma
Copy link

artur-ma commented May 2, 2021

Same here after upgrade 4.19.2 => 4.27.1

We see Redis - error: ClusterAllFailedError: Failed to refresh slots cache." when starting intensive writes on redis cluster
and then a lot of Error: Cluster isn't ready and enableOfflineQueue options is false

leibale added a commit to leibale/ioredis that referenced this issue May 3, 2021
@luin luin closed this as completed in aa9c5b1 May 3, 2021
@alexandrugheorghe
Copy link

I believe the commit might have failed to build and deploy. @luin Could you have another look?

ioredis-robot pushed a commit that referenced this issue May 4, 2021
## [4.27.2](v4.27.1...v4.27.2) (2021-05-04)

### Bug Fixes

* **cluster:** avoid ClusterAllFailedError in certain cases ([aa9c5b1](aa9c5b1)), closes [#1330](#1330)
@ioredis-robot
Copy link
Collaborator

🎉 This issue has been resolved in version 4.27.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

@trademark18
Copy link

Hi all, I'm on 4.27.6 and I'm having a similar issue where I see this error just when under a heavy load:

ERROR [ioredis] Unhandled error event: ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:396:31)
at /var/task/node_modules/ioredis/built/cluster/index.js:413:21
at Timeout.<anonymous> (/var/task/node_modules/ioredis/built/cluster/index.js:671:24)
at Timeout.run (/var/task/node_modules/ioredis/built/utils/index.js:156:22)
at listOnTimeout (internal/timers.js:557:17)
at processTimers (internal/timers.js:498:7)

And then the Lambda ends with this error message:

{
  "errorType": "Error",
  "errorMessage": "None of startup nodes is available",
  "stack": [
    "Error: None of startup nodes is available",
    " at Cluster.closeListener (/var/task/node_modules/ioredis/built/cluster/index.js:184:35)",
    " at Object.onceWrapper (events.js:482:28)",
    " at Cluster.emit (events.js:388:22)",
    " at /var/task/node_modules/ioredis/built/cluster/index.js:367:18",
    " at processTicksAndRejections (internal/process/task_queues.js:77:11)"
  ]
}

Here's my Redis.Cluster() code:

cluster = new Redis.Cluster(
[
	{ 'host': redisSecret.host }
], 
{
	dnsLookup: (address, callback) => callback(null, address),
	redisOptions: {
		tls: true,
	},
	clusterRetryStrategy: (times) => {
		const ms = Math.min(100 * times, 2000);
		console.log(`Cluster retry #${times}: Will wait ${ms} ms`);
		return ms;
	}
}
);

Questions:

  1. Can I avoid it entirely?
  2. Since the Lambda errors out, it's going to trigger alarms and such. I'd rather have it just silently attempt to reconnect to the cluster. Is there some way to catch this and then just proceed silently to the retry?

Sorry to comment on a closed issue but this is the only place I've found anyone talking about seeing this error only under heavy load as opposed to just incorrect network configuration or similar.

@bkvaiude
Copy link

bkvaiude commented Jun 16, 2021

Need help to drill down the following issue
Please check the following needful information for more context

ClusterAllFailedError: Failed to refresh slots cache.
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:396:31)
at /var/task/node_modules/ioredis/built/cluster/index.js:413:21
at Timeout.<anonymous> (/var/task/node_modules/ioredis/built/cluster/index.js:671:24)
at Timeout.run (/var/task/node_modules/ioredis/built/utils/index.js:156:22)
at listOnTimeout (internal/timers.js:556:17)
at processTimers (internal/timers.js:497:7) {
lastNodeError: Error: timeout
at Object.timeout (/var/task/node_modules/ioredis/built/utils/index.js:159:38)
at Cluster.getInfoFromNode (/var/task/node_modules/ioredis/built/cluster/index.js:668:55)
at tryNode (/var/task/node_modules/ioredis/built/cluster/index.js:402:19)
at Cluster.refreshSlotsCache (/var/task/node_modules/ioredis/built/cluster/index.js:421:9)
at /var/task/node_modules/ioredis/built/cluster/index.js:192:22
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at runNextTicks (internal/process/task_queues.js:66:3)
at listOnTimeout (internal/timers.js:523:9)
at processTimers (internal/timers.js:497:7)
}

I'm facing the same issue where the new lambda version release started throwing a massive number of errors

ioredis version 4.27.6
AWS Lambda nodejs12.x
Redis 6.0.5

AWS Redis cluster metrics look good and healthy, it is confirmed by AWS tech support too.
Couldn't able to root cause the issue

Redis Cluster Initilization

        this.client = new redis.Cluster([redisConfig], {
		dnsLookup: (address, callback) => callback(null, address),
		slotsRefreshTimeout: 5000,
		slotsRefreshInterval: 1 * 60 * 1000,
	});

@vaughandroid
Copy link

@bkvaiude I looked into this a bit. The "ClusterAllFailedError: Failed to refresh slots cache." is thrown prior to ioredis attempting to reconnect to the cluster. In other words, it's a recoverable error. On my team we're tracking a metric for them but not treating them as an error that needs to be addressed.

Ideally I would like get a clear signal when it can't manage to reconnect. I thought to do that using clusterRetryStrategy and fail after n retries, but it appears there are (some issues with that right now)[https://github.com//issues/1062]. For now, we're OK to live with it as it is.

@trademark18 Again, I think the "Failed to refresh slots cache" errors can be treated as warnings. It looks like something is closing the connection (you can see it originates from Cluster.closeListener) and that's what is terminating your lambda.

@bkvaiude
Copy link

bkvaiude commented Jun 22, 2021

Thanks, @vaughandroid for sharing your experience, it is really insightful and helpful.

The problem which I faced was pretty much stupid, because of the wrong tls configuration, we are facing a connection issue with AWS Redis.

AWS ElasticCache Clustered without TLS and AUTH

The problematic situation for the developer:

The IORedis doesn't provide the right information for generated error.

When we try to replicate the issue locally with a trial and error approach, we received the same error for the following case

  • Passing wrong connection URL
  • Passing invalid credentials (username and password)
  • tls configuration

Now, you see it is difficult to process the error details for the above use cases and act on them.
If the right error details have been provided then the developer easily gets the context of wrongdoing and s/he can take needful precautions to fix the issue.

In our case, we doubted the Redis and started looking into the performance metrics

@trademark18 I hope IORedis can consider this feedback and do the needful changes with error handling documentation

And also client.error callback handle such type of errors and not impacting anymore on the aws lambda function

Thank you very much!

janus-dev87 added a commit to janus-dev87/ioredis-work that referenced this issue Mar 1, 2024
## [4.27.2](redis/ioredis@v4.27.1...v4.27.2) (2021-05-04)

### Bug Fixes

* **cluster:** avoid ClusterAllFailedError in certain cases ([aa9c5b1](redis/ioredis@aa9c5b1)), closes [#1330](redis/ioredis#1330)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants