Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RedisClusterClient gives much higher load to first healthy node #404

Closed
Dieken opened this issue Sep 14, 2023 · 18 comments · Fixed by #405
Closed

RedisClusterClient gives much higher load to first healthy node #404

Dieken opened this issue Sep 14, 2023 · 18 comments · Fixed by #405
Labels
Milestone

Comments

@Dieken
Copy link

Dieken commented Sep 14, 2023

For each command RedisClusterClient.connect() always calls getSlots() to send command CLUSTER SLOTS to first healthy node, thus the first healthy node has very high CPU usage, in my benchmark test with Quarkus, the first healthy node consumes 80% CPU and other nodes consume 30% CPU, this leads to extra latency and uneven load on Redis cluster.

The getSlots() call should be cached and automatically periodically updated, or updated by MOVED response from Redis cluster.

@Dieken Dieken added the bug label Sep 14, 2023
@tsegismont tsegismont added this to the 4.5.0 milestone Sep 14, 2023
@tsegismont
Copy link
Contributor

Thanks for reporting this @Dieken

@Ladicek would you have some time to take care of this?

@Ladicek
Copy link
Contributor

Ladicek commented Sep 14, 2023

So there are 2 ways how the Redis client may be used:

  1. obtain the Redis client and use its send() and batch() methods;
  2. obtain the Redis client, from there obtain a RedisConnection using connect(), and use its send() and batch() methods.

The first approach has an obvious disadvantage: for each command, it obtains a new connection. Not a TCP connection, those are pooled, but a connection object that may holds important state. In case of clustered Redis, that state held in the connection object includes the hash slot distribution. So using the first approach, every command indeed requires first sending CLUSTER SLOTS -- when you use the second approach, that is not the case.

(Looking at the code, I can see that the clustered client handles ASK redirections, but it seems that a MOVED redirection just ends up as an error, and it is a responsibility of the caller to close the connection and reconnect. I might be reading the code wrong, though. I'd have to explore more to be confident.)

I don't know if it is the Quarkus Redis client who doesn't bother maintaining a RedisConnection and always uses the 1st approach as described above, or whether the Quarkus Redis client exposes both approaches and it is the user who needs to decide. @Dieken could you please share your Quarkus code so I can check more? Thanks!

@Dieken
Copy link
Author

Dieken commented Sep 14, 2023

The code is hard to extract, let me describe the call chain:

@Inject ReactiveRedisDataSource creates ReactiveSortedSetCommandsImpl and ReactiveKeyCommandsImpl instances, all methods in these two classes call AbstractRedisCommands.execute().

AbstractRedisCommands.execute()
   ReactiveRedisDataSourceImpl.execute()

      @Override
      public Uni<Response> execute(Request request) {
        if (connection != null) {
            return connection.send(request);
        }
        return redis.send(request);      // <-- reach here
      }


           ObservableRedis.send()
                RedisClusterClient.send()
                     BaseRedisClient.send()
                           RedisClusterClient.connect()   // reach second way above

So each command from ReactiveSortedSetCommandsImpl and ReactiveKeyCommandsImpl triggers a call to RedisClusterClient.connect() which obtains a pooled RedisConnection and sends CLUSTER SLOTS command before the real original command.

I use Quarkus-3.3.0.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 14, 2023

Yeah, I'm just looking at the Quarkus implementation too. I have never seen that code before, so I might be wrong, but it seems to me that if you use ReactiveRedisDataSource.withConnection(callback), the Redis data source first obtains a Vert.x RedisConnection and uses it to execute all the Redis commands in the callback. Redis commands made simply on the injected ReactiveRedisDataSource are executed on the Vert.x Redis object, so for each of them, a RedisConnection is created. This is unfortunately not documented in https://quarkus.io/guides/redis-reference

So in the short term, you can use either withConnection() if that works for you, or you can inject the Redis client and use that manually (somewhat lower-level).

Longer-term, I don't really know what's the Vert.x Redis client philosophy. Are the Redis.send() / batch() method a mere convenience mostly for throwaway code and real-world code should obtain a RedisConnection and use it for all operations (maybe reconnecting in case of transient errors)? If so, the Quarkus Redis extension is at odds with that. Or are the Redis.send() / batch() methods the primary way the Redis client should be used and RedisConnection should mostly be obtained just for special cases (like Redis transactions)? If so, the RedisConnection objects holding client-wide state is wrong and the state should be held in Redis objects instead. Or are these 2 APIs equivalent? That's probably the same situation as the previous one -- client-wide state held in connection objects is still wrong.

Looking at the various Redis*Connection implementations, it seems that the only one holding client-wide state is RedisClusterConnection and the only client-wide state it holds is the hash slot assignment. So maybe we should move that to RedisClusterClient, but I'd first like to hear some expert's opinion on the previous paragraph :-) Maybe @pmlopes or @cescoffier?

@Dieken
Copy link
Author

Dieken commented Sep 15, 2023

The high level ReactiveXxxCommands API has no chance to use withConnection() or batch(). I think it’s OK to obtain pooled connection for each command, the problem is Slots data should belong to RedisClusterClient as you described above, and the Slots data should be cached, never send “CLUSTER SLOTS” first for each Redis command.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 15, 2023

You said you're @Inject-ing a ReactiveRedisDataSource, which has the withConnection() method. So it certainly is available to you. It probably isn't useful to you, but that's hard to know for sure without seeing the code. Hence my other recommendation to descend to the direct Vert.x API (perhaps in the Mutiny variant), which is also available and provides more control.

I overall agree that the hash slot assignment should probably be stored in the RedisClusterClient, and I'll see what would it take to do it, but I'd like to hear from the experts too.

@Dieken
Copy link
Author

Dieken commented Sep 15, 2023

I don't directly use the low level API ReactiveRedisDataSource, I use the high level API ReactiveXxxCommands which is created from ReactiveRedisDataSource, the high level API doesn't have withConnection() method.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 15, 2023

ReactiveRedisDataSource is not a low-level API, it is the Quarkus high-level API. You said that you use @Inject ReactiveRedisDataSource -- that's what I mean. There's a withConnection() method there.

When I'm referring to low-level APIs, I'm talking about the Vert.x APIs (Redis, RedisConnection).

@Dieken
Copy link
Author

Dieken commented Sep 15, 2023

My bad, the Quarkus guide does say ReactiveRedisDataSource is high level API: https://quarkus.io/guides/redis-reference#apis

@Dieken
Copy link
Author

Dieken commented Sep 16, 2023

Checked the ReactiveRedisDataSource.withConnection() API again, if I create RedisXXXCommands instance in withConnection(), not in constructor of ReactiveEasy resource class as Quarkus Redis guide describes, the uneven load will be proportional to calls of withConnection not number of Redis command, that’s better but uneven load still exists, the Slots data still should be cached.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 18, 2023

Yeah, I suspected as much. I'll see what I can do.

@vietj
Copy link
Contributor

vietj commented Sep 18, 2023

I am wondering if such redis connection should not be pooled as well by the vertx redis client, WDYT @pmlopes ?

@Ladicek
Copy link
Contributor

Ladicek commented Sep 18, 2023

TCP connections to the Redis server(s) are pooled. The RedisConnection object is more like a "client session" or something, the name is slightly misleading.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 19, 2023

I'm trying to reproduce this locally, but I'm hitting a connection pooling issue (basically connection pooling doesn't seem to work at all) on the 4.x branch (master works fine), need to figure that out first. Just to let everyone know I'm working on this, though it might take a few days :-)

@Ladicek
Copy link
Contributor

Ladicek commented Sep 20, 2023

OK, so the connection pooling not working at all is #365 (and the fix is #374). This is currently scheduled for 5.0, which in my opinion is wrong -- we totally need to fix this in 4.x, otherwise Redis cluster is basically unusable. I can't see any possible downside to backporting the fix, it's just about adding equals/hashCode to 2 classes that are part of the key used to loookup the Endpoint (= connection pool) from ConnectionManager.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 20, 2023

OK, so once I actually make the cluster client behave, this issue is fairly easy to reproduce with this simple Vert.x application:

public class MainVerticle extends AbstractVerticle {
    public static void main(String[] args) {
        Vertx vertx = Vertx.vertx();
        vertx.deployVerticle(new MainVerticle());
    }

    @Override
    public void start(Promise<Void> startPromise) {
        Redis redis = Redis.createClient(vertx, new RedisOptions()
                .setType(RedisClientType.CLUSTER)
                .setUseReplicas(RedisReplicas.SHARE)
                .setConnectionString("redis://192.168.1.171"));
        RedisAPI client = RedisAPI.api(redis);

        call(client, "foo", 0);
        call(client, "bar", 0);
        call(client, "baz", 0);
        call(client, "qux", 0);
        call(client, "quux", 0);
        call(client, "corge", 0);
        call(client, "grault", 0);
        call(client, "garply", 0);
        call(client, "waldo", 0);
        call(client, "fred", 0);
        call(client, "plugh", 0);
        call(client, "xyzzy", 0);
        call(client, "thud", 0);
    }

    private void call(RedisAPI client, String prefix, int last) {
        if (last == 1_000_000) {
            return;
        }
        int x = last + 1;
        client.set(List.of(prefix + x, "" + x))
                .flatMap(response -> {
                    return client.get(prefix + x);
                }).flatMap(response -> {
                    System.out.println(prefix + " --> " + response);
                    return client.set(List.of("__last__" + prefix, "" + x));
                }).onSuccess(response -> {
                    vertx.runOnContext(ignored -> call(client, prefix, x));
                }).onFailure(error -> {
                    System.out.println(error);
                });
    }
}

I'm using a simple 3-node cluster with 3 masters and no replicas. I run an actual virtual machine with 1 CPU and 1024 MB of RAM for each cluster node, so I can actually measure load average and process CPU consumption.

I made an interesting observation: when I tried to even the load on the cluster nodes by issuing the CLUSTER SLOTS command to a random node (instead of always the first), I saw slightly increased load on the other nodes, but the first node had even higher load than before. This suggests that the other nodes forward the command to the first one (or something like that), but I'm no Redis cluster expert and didn't really look into the code. Anyway, this doesn't seem to be a good strategy.

Next, I'll look into storing the hash slot assignment in the RedisClusterClient as mentioned above.

@Ladicek
Copy link
Contributor

Ladicek commented Sep 20, 2023

OK, storing hash slot assignment in the RedisClusterClient seems to work well. I'll submit a PR against master.

I'll also send 2 PRs to 4.x: backport of #374 and a backport of the fix for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants