-
-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to github.com/go-redis/redis #235
Conversation
Codecov Report
@@ Coverage Diff @@
## master #235 +/- ##
==========================================
+ Coverage 83.06% 84.27% +1.20%
==========================================
Files 36 37 +1
Lines 8207 8017 -190
==========================================
- Hits 6817 6756 -61
+ Misses 1055 948 -107
+ Partials 335 313 -22
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Oh, wow, that's the hard work! And seems great overall 👍 Removing dependencies and the code that manages various connection modes is awesome. Probably let's collaborate next time before introducing such a big changes? Just to make sure we are on the same page about things. In general, the issue was about researching things, we need to prove the migration to I have not looked at code closely yet – will do at some point, for now some top-level things:
Redis Cluster can be run with sth like this locally (but I suppose to compare the performance it's better to try excluding Docker from setup eventually):
|
Hi, thanks for your comment! I'm sorry for submitting an important change without making any notes. In fact, I really wanted to explain something, but my English is so poor that it was hard to express it clearly. So I chose to submit the code first. This PR is only a proof of concept and there is still a lot of work before it is finished. I wanted to send it first and hear your opinion, before me going too far down the wrong path. Some to-do before it's ready for review:
With the current implementation, I ran some benchmark locally, but most of the results are a bit worse than redigo (~10% slower). I'm trying to find out why and do some optimization and hopefully we can get a better result. |
Hmm, interesting – I expected a better performance than with Redigo. Actually I did some measurements before opening that issue and results were quite promising. That was the biggest motivation to try the migration. So I'd say looking at benchmarks is the first step here to decide whether it's worth it or not. Since otherwise current implementation is pretty stable I believe to change it for sth new. |
Here are some benchmark results ran in my local machine: goos: darwin Most numbers are fine, except for
|
After fixed the
|
Hi, this is an Interesting work, I just started my implementation using https://github.com/rueian/rueidis which supports sharded pubsub feature of redis 7 for my project. |
Think bench results look promising, @j178 actually we are interested not only in latency improvements but also in the number of allocations since that directly affects CPU usage. I suppose that in As the next step I think comparing the performance of @tony-pang I used |
Sorry, I mean performance comparison in Redis Cluster case, updated the comment :) |
Will do it 😄 But I want to figure out why the |
Tried, works fine for both |
That's right. Seems like there is something wrong with my setup. |
@FZambia Hi, have you tried run the benchamrk multiple times? like |
Yep - reproduced! Could you try #237 - this fixed the issue for me. |
ffcf673
to
3299619
Compare
Latest benchstat results against the master branch(including cluster mode stats):
|
I think overall looks good and worth moving forward 👍 As the next step - let's concentrate on timeouts? In general the goal is to avoid calls without timeout. We need to make sure we don't have infinite waiting for sending/waiting indefinitely. In Redigo case we achieve this by having both Possibly From time to time I am thinking that all our Engine methods should have context.Context as first argument (and possibly Node methods that call Engine), but that's sth we can consider later. Whether using context cancellation/timeouts from the outside is good or not is not obvious for me at this point. Most Go libraries tend to propagate Context in every call where possible, but since we can't really cancel operation sent to Redis - I still think global timeouts have sense and pretty practical. I will try to look at timeouts too and code here more closely during next days - for now a bit busy with various home tasks :) Many thanks for pushing this forward @j178. |
5573178
to
603025d
Compare
Some interesting links I just found: https://redis.uptrace.dev/guide/go-redis-debugging.html#timeouts and https://uptrace.dev/blog/posts/go-context-timeout.html |
shard.subCh = make(chan subRequest) | ||
shard.pubCh = make(chan pubRequest) | ||
shard.dataCh = make(chan *dataRequest) | ||
if !shard.useCluster { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go-redis implements a way to run pipeline in cluster mode too, let's try it out and see whether it works。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this way btw? So currently in runDataPipeline we will receive requests with keys belonging to different key slots. So we will construct a pipeline object from them all. Does go-redis do some sort of remapping in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does go-redis do some sort of remapping in this case?
That's right, go-redis maps commands in a pipeline to different nodes, in ClusterClient._processPipeline
:
https://github.com/go-redis/redis/blob/c561f3ca7e5cf44ce1f1d3ef30f4a10a9c674c8a/cluster.go#L1062
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm 🤔 While this leads to a better performance, it makes me think that go-redis approach with splitting commands by node and attempting to retry the entire pipeline in case of an error on one node may increase the chance of getting duplicate command invocations. For example:
- We have 2 "publish with history" requests: first to node A, second to node B
- Assuming A is currently not available, like in the process of failover
- Part of the pipeline will be successfully executed on B, but we will get an error since it was failed on A
- All commands in batch will get an error on Centrifuge level, possibly will be re-issued by some application code - thus there is a good chance to execute publish on B again.
Think the scenario above (and some other error scenarios) only has a negative effect for publishing with history at the moment - i.e. when double invocations as a result of retry on application level result into duplicate messages sent and duplicate messages saved into a stream.
It seems to me that a better way to manage this is only retrying failed parts of pipeline, not the entire one. I think it's properly done in rueidis
library which has implicit pipelining and we have a granular error management for pieces of pipeline out of the box. For go-redis
I don't think we can simply use pipelining in Cluster mode given the concerns above ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attempting to retry the entire pipeline in case of an error on one node
This is not right I think, _processPipeline
records failed commands in failedCmds
, and then attempts to run those failed only IIUIC: https://github.com/go-redis/redis/blob/c561f3ca7e5cf44ce1f1d3ef30f4a10a9c674c8a/cluster.go#L1104
possibly will be re-issued by some application code - thus there is a good chance to execute publish on B again
This is possible, but I don't see it will happen in centrifuge. Because we collect commands and run in a pipeline in the redis_shard level, if failed, by no means we will retry the whole batch in this level (since it is dangerous), and the upper level, knows nothing about the pipelining thing, what it can do is just reissuing the single failed command it issued before. So I don't think the issue will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but I meant sth different. Imagine a Redis Cluster scenario, in this case we only have one Redis shard on Centrifuge level that operates with Cluster. Now, let's suppose we have the case I described above:
- We have 2 "publish with history" requests: first to node A, second to node B. They are combined into one pipeline and executed together inside
runDataPipeline
. - Assuming A is currently not available, like in the process of failover. And B is healthy
go-redis
will execute part of the pipeline successfully (on B), part of pipeline with error (on A). But we will get an error frompipe.Exec
Line 348 in 400bdb9
cmds, err := pipe.Exec(context.Background()) go-redis
returns first observed error: https://github.com/go-redis/redis/blob/c561f3ca7e5cf44ce1f1d3ef30f4a10a9c674c8a/cluster.go#L1107- All commands in batch will get an error on Centrifuge level due to this one error on A - see
Line 352 in 400bdb9
if err != nil {
Does it make sense? Or I am missing sth? Possibly we can change Centrifuge code a bit to iterate over pipeline Exec result and manually extracting which commands were successful and which failed. And extracting cmd results from successful only.
redis_shard.go
Outdated
} | ||
} | ||
cmds, err := pipe.Exec(context.Background()) | ||
if err == redis.Nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can miss the error here as go-redis
always returns first non-nil error when executing Pipeline and I have not found it checks for redis.Nil internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This err == redis.Nil
is something I don't know how to handle correctly. For Redis scripts that return nothing, like addPresenceSource
, Redis will reply with a nil string, go-redis
takes it as an error and returns a redis.Nil
error.
If we don't ignore the redis.Nil
here, then the whole runDataPipeline
will exit and rerun, that's a huge performance loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't ignore the redis.Nil here, then the whole runDataPipeline will exit and rerun, that's a huge performance loss.
Yes, definitely we should not restart pipeline. For me it seems like an issue of go-redis
- because it seems that redis.Nil
is not actually an error so pipe.Exec
should not return it as an error
at all.
Possibly we should just ignore error
from pipe.Exec
and just iterate over returned cmds
to handle each error individually 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a commit that ignores errors from pipe.Exec
, and defer the redis.Nil
judgment to the upper lever (Presencemanager.addPresence
and removePresence
) where it knows whether redis.Nil
is expected or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so this should also effectively solve the concerns in this comment right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, current behavior is in line with this statement:
we can change Centrifuge code a bit to iterate over pipeline Exec result and manually extract which commands were successful and which failed. And extracting cmd results from successful only.
73d6364
to
1f62391
Compare
Closes #210