-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JetStream scaler should query the stream consumer leader when clustered #3564
JetStream scaler should query the stream consumer leader when clustered #3564
Conversation
@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :) Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think? |
@goku321 nice! I did notice we had separate tests like that for redis and I think it is probably a good idea here as well. I don't have much experience writing tests, and it might take me a while to circle back to this to give those a shot. If you or anyone else would like to take a crack at it that would be fine. |
@rayjanoka Sure. I can start with the tests in a week or so. I hope that's okay :) |
I'm back on this, I'm going to work on putting this in a function and writing some tests. |
3c75054
to
7db998a
Compare
ok, this is looking good! I did some serious surgery and we're at 96% test coverage! 😆 I'm going to give this a run in my lab and I'll let ya know how it goes. |
0c39690
to
2e36647
Compare
2e36647
to
4d44456
Compare
4d44456
to
4516b9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is looking good, there are a few leftovers:
- could you please create a new issue describing the problem and reference it in the changelog?
- the test coverage is great! I think that @goku321 was talking about e2e test that we have, to support clustered setup: Provide scaler for NATS Jetstream #2391
/run-e2e nats* |
4516b9c
to
b5bfecc
Compare
5d57c2c
to
ad7fd89
Compare
af83fe0
to
f8ea5dd
Compare
Updates
Am I able to trigger the e2e test here? or if one of the crew could start it we can see if it works here. |
7693425
to
1b46ff3
Compare
/run-e2e nats* |
No you can't. To trigger them, it's required to be in a specific team in the org. I have already triggered |
ok, it looks like it worked...I can't see the log, but I see a green check. I'll take one more look over everything on Monday, then we can do the review and get this finished. thanks! |
Perfect! Thanks for your contribution! |
ah hah! of course, the log looks good good! |
1b46ff3
to
122e4e5
Compare
ok, I just rebased, should be good for review and delivery! @goku321 @JorTurFer @zroubalik |
/run-e2e nats* |
LGTM! the clustered test is long, but I helped a bit overall by cutting 2.5 mins off the standalone run by pushing less messages through in the test, 1000 -> 300. standalone e2e test: 59 secs |
/run-e2e nats* |
@JorTurFer @zroubalik Can you re-review this PR please to see if we need to change things before our release on Thursday? |
122e4e5
to
7ed83a5
Compare
fixed! thanks @tomkerkhove! |
…when clustered Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>
7ed83a5
to
61a844d
Compare
/run-e2e nats* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks a lot for the contribution!
…when clustered (kedacore#3564) Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>
I went to test drive the new JetStream scaler on my project and it started to have an issue. Eventually I figured out that the NATS monitoring endpoint isn't reporting accurate metrics on stream consumers from all pods in a cluster.
I found that when running a cluster of nats pods, only the stream consumer's leader pod reports the accurate number of messages in the queue. The jetstream scaler would work fine for me at first as long as keda was connected to the nats consumer leader pod, but as soon as I bounced keda's connection to a nats consumer replica pod the
num_pending
value there counts up but never down, so my deployment just scales up and up and up.➜ nats consumer info test-stream durable Cluster Information: Name: nats Leader: nats-1 <---- Leader Pod Replica: nats-0, current, seen 0.33s ago Replica: nats-2, current, seen 0.33s ago
We are able to discover the jetstream consumer's leader via the existing monitoring endpoint call.
To get the accurate count to the scaler I wrote a change to make a 2nd request directly to that consumer leader pod via the headless svc, ex
consumer-leader-pod.nats.nats.svc.cluster.local
.NOTE: This fix will only work for clusters who have the same number of pods as stream replicas, ex. 3 pods and 3 stream replicas or 5 pods and 5 stream replicas.
this is resolved below #3564 (comment)
If there are less stream replicas than pods, ex. 3 pods and only 1 stream replica, NATS will only display metrics for that consumer on the single pod that the stream is assigned to, the other pods have no record of the consumer at all so it will only find the metric if the k8s svc happens to terminate you to that particular pod. Without any other clues to figure out which pod has the metric I think we'd have to rotate through each NATS pod blindly until we found the pod, not great. (I can document this limitation in the keda-docs for now)@goku321 for visibility - thanks for contributing this!
I'll send a ticket over to the NATS side as well and see if they can work to provide accurate metrics across all NATS servers in a cluster so no matter what server we connect to we see everything.
Checklist
Fixes #3860
Relates to #