JetStream scaler should query the stream consumer leader when clustered #3564

rayjanoka · 2022-08-18T18:42:00Z

I went to test drive the new JetStream scaler on my project and it started to have an issue. Eventually I figured out that the NATS monitoring endpoint isn't reporting accurate metrics on stream consumers from all pods in a cluster.

I found that when running a cluster of nats pods, only the stream consumer's leader pod reports the accurate number of messages in the queue. The jetstream scaler would work fine for me at first as long as keda was connected to the nats consumer leader pod, but as soon as I bounced keda's connection to a nats consumer replica pod the num_pending value there counts up but never down, so my deployment just scales up and up and up.

➜ nats consumer info test-stream durable

Cluster Information:
                Name: nats
              Leader: nats-1   <---- Leader Pod
             Replica: nats-0, current, seen 0.33s ago
             Replica: nats-2, current, seen 0.33s ago

# Consumer Leader (accurate)
➜ curl -s "http://nats-1.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 0,

# Replicas (counting up only)
➜ curl -s "http://nats-0.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,
➜ curl -s "http://nats-2.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,

We are able to discover the jetstream consumer's leader via the existing monitoring endpoint call.

$ curl -s "http://nats.nats.svc.cluster.local:8222/jsz?consumers=true" | jq '.account_details[0].stream_detail[0].consumer_detail[0].cluster.leader'
"nats-0"

To get the accurate count to the scaler I wrote a change to make a 2nd request directly to that consumer leader pod via the headless svc, ex consumer-leader-pod.nats.nats.svc.cluster.local.

NOTE: This fix will only work for clusters who have the same number of pods as stream replicas, ex. 3 pods and 3 stream replicas or 5 pods and 5 stream replicas.

➜ nats stream info test-stream
Information for Stream test-stream

Configuration:

             Subjects: sub
             Replicas: 3   <---- stream replica setting here

this is resolved below #3564 (comment)
If there are less stream replicas than pods, ex. 3 pods and only 1 stream replica, NATS will only display metrics for that consumer on the single pod that the stream is assigned to, the other pods have no record of the consumer at all so it will only find the metric if the k8s svc happens to terminate you to that particular pod. Without any other clues to figure out which pod has the metric I think we'd have to rotate through each NATS pod blindly until we found the pod, not great. (I can document this limitation in the keda-docs for now)

@goku321 for visibility - thanks for contributing this!

I'll send a ticket over to the NATS side as well and see if they can work to provide accurate metrics across all NATS servers in a cluster so no matter what server we connect to we see everything.

Checklist

Commits are signed with Developer Certificate of Origin (DCO - learn more)
Tests have been added
[NA] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
[NA] A PR is opened to update the documentation on (repo) (if applicable)
Changelog has been updated and is aligned with our changelog requirements

Fixes #3860

Relates to #

goku321 · 2022-08-18T19:26:47Z

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

rayjanoka · 2022-08-18T20:04:56Z

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

@goku321 nice! I did notice we had separate tests like that for redis and I think it is probably a good idea here as well. I don't have much experience writing tests, and it might take me a while to circle back to this to give those a shot. If you or anyone else would like to take a crack at it that would be fine.

goku321 · 2022-08-21T18:25:30Z

@rayjanoka Sure. I can start with the tests in a week or so. I hope that's okay :)

pkg/scalers/nats_jetstream_scaler.go

rayjanoka · 2022-11-11T16:12:52Z

I'm back on this, I'm going to work on putting this in a function and writing some tests.

rayjanoka · 2022-11-14T20:26:02Z

ok, this is looking good! I did some serious surgery and we're at 96% test coverage! 😆

I'm going to give this a run in my lab and I'll let ya know how it goes.

rayjanoka · 2022-11-15T00:01:05Z

I think this is ready for review, it tested ok in my environment.

De-duplicated code
Lots of new tests
All tests passed!
Coverage:

Let me know if I need to do anything else!

zroubalik

This change is looking good, there are a few leftovers:

could you please create a new issue describing the problem and reference it in the changelog?
the test coverage is great! I think that @goku321 was talking about e2e test that we have, to support clustered setup: Provide scaler for NATS Jetstream #2391

CHANGELOG.md

zroubalik · 2022-11-15T10:00:24Z

/run-e2e nats*
Update: You can check the progress here

rayjanoka · 2022-11-19T08:35:21Z

Updates

new clustered e2e test with 3 nats nodes - takes about 6.5 mins to run
- tests all combinations of stream replicas (1, 2, & 3) in the cluster
lowered the number of messages to process in the standalone test so it is faster
adding all combinations of streams replicas to the e2e test required supporting all the stream replicas. This required a bit more logic back in the main code to fall back to scanning each nats node to blindly discover the leader node when sometimes no stream/consumer context is returned in the initial request to nats (when the response is from a node where the stream is not running).
we are caching the leader once we find it so we only need to make a single request, there are multiple calls to discover in the first run, or if the stream leader moves to another node

Am I able to trigger the e2e test here? or if one of the crew could start it we can see if it works here.

JorTurFer · 2022-11-21T17:39:49Z

/run-e2e nats*
Update: You can check the progress here

JorTurFer · 2022-11-21T17:41:48Z

Am I able to trigger the e2e test here? or if one of the crew could start it we can see if it works here.

No you can't. To trigger them, it's required to be in a specific team in the org. I have already triggered

rayjanoka · 2022-11-23T19:40:52Z

ok, it looks like it worked...I can't see the log, but I see a green check.

I'll take one more look over everything on Monday, then we can do the review and get this finished. thanks!

JorTurFer · 2022-11-23T21:52:06Z

You can find the specific workflow logs following the link in the comment

There, you have to go to the Execute e2e tests job:

And there you only need to check the step:

JorTurFer · 2022-11-23T21:52:33Z

I'll take one more look over everything on Monday, then we can do the review and get this finished. thanks!

Perfect! Thanks for your contribution!

rayjanoka · 2022-11-29T03:08:30Z

You can find the specific workflow logs following the link in the comment

ah hah! of course, the log looks good good!

rayjanoka · 2022-11-29T03:12:28Z

ok, I just rebased, should be good for review and delivery! @goku321 @JorTurFer @zroubalik

JorTurFer · 2022-11-29T07:56:35Z

/run-e2e nats*
Update: You can check the progress here

rayjanoka · 2022-11-29T18:43:03Z

/run-e2e nats* Update: You can check the progress here

LGTM!

the clustered test is long, but I helped a bit overall by cutting 2.5 mins off the standalone run by pushing less messages through in the test, 1000 -> 300.

standalone e2e test: 59 secs
clustered e2e test: 7.3 mins

tomkerkhove · 2022-12-02T08:39:23Z

/run-e2e nats*
Update: You can check the progress here

tomkerkhove · 2022-12-02T08:40:48Z

@JorTurFer @zroubalik Can you re-review this PR please to see if we need to change things before our release on Thursday?
@rayjanoka can you fix the merge conflict please?

rayjanoka · 2022-12-02T22:19:40Z

@JorTurFer @zroubalik Can you re-review this PR please to see if we need to change things before our release on Thursday? @rayjanoka can you fix the merge conflict please?

fixed! thanks @tomkerkhove!

…when clustered Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>

zroubalik · 2022-12-03T17:49:12Z

/run-e2e nats*
Update: You can check the progress here

zroubalik

LGTM

thanks a lot for the contribution!

…when clustered (kedacore#3564) Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>

rayjanoka requested a review from a team as a code owner August 18, 2022 18:42

JorTurFer reviewed Aug 23, 2022

View reviewed changes

pkg/scalers/nats_jetstream_scaler.go Outdated Show resolved Hide resolved

rayjanoka requested a review from JorTurFer September 16, 2022 21:09

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 5 times, most recently from 3c75054 to 7db998a Compare November 14, 2022 20:23

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 5 times, most recently from 0c39690 to 2e36647 Compare November 14, 2022 22:22

rayjanoka changed the title ~~JetStream scaler query the stream consumer's leader pod when clustered~~ JetStream scaler use stream consumer leader pod metric when clustered Nov 15, 2022

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 2e36647 to 4d44456 Compare November 15, 2022 00:18

rayjanoka changed the title ~~JetStream scaler use stream consumer leader pod metric when clustered~~ JetStream scaler use stream consumer leader metric when clustered Nov 15, 2022

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 4d44456 to 4516b9c Compare November 15, 2022 03:34

zroubalik reviewed Nov 15, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

rayjanoka mentioned this pull request Nov 15, 2022

JetStream scaler should query the stream consumer leader when clustered #3860

Closed

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 4516b9c to b5bfecc Compare November 15, 2022 19:38

rayjanoka changed the title ~~JetStream scaler use stream consumer leader metric when clustered~~ JetStream scaler should query the stream consumer leader when clustered Nov 15, 2022

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 2 times, most recently from 5d57c2c to ad7fd89 Compare November 15, 2022 20:12

rayjanoka closed this Nov 19, 2022

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from af83fe0 to f8ea5dd Compare November 19, 2022 08:14

rayjanoka reopened this Nov 19, 2022

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 7693425 to 1b46ff3 Compare November 19, 2022 08:43

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 1b46ff3 to 122e4e5 Compare November 29, 2022 03:10

rayjanoka requested review from goku321 and zroubalik and removed request for JorTurFer and goku321 November 29, 2022 03:11

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 122e4e5 to 7ed83a5 Compare December 2, 2022 22:18

Add support for JetStream scaler to query the stream consumer leader …

61a844d

…when clustered Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>

rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 7ed83a5 to 61a844d Compare December 2, 2022 23:00

zroubalik approved these changes Dec 5, 2022

View reviewed changes

zroubalik merged commit eb6dd2f into kedacore:main Dec 5, 2022

josephangbc pushed a commit to josephangbc/keda that referenced this pull request Dec 6, 2022

Add support for JetStream scaler to query the stream consumer leader …

249682d

…when clustered (kedacore#3564) Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>

rayjanoka mentioned this pull request May 2, 2023

Add support when JetStream cluster not on kubernetes #4102

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JetStream scaler should query the stream consumer leader when clustered #3564

JetStream scaler should query the stream consumer leader when clustered #3564

rayjanoka commented Aug 18, 2022 •

edited by zroubalik

Loading

goku321 commented Aug 18, 2022

rayjanoka commented Aug 18, 2022

goku321 commented Aug 21, 2022

rayjanoka commented Nov 11, 2022

rayjanoka commented Nov 14, 2022 •

edited

Loading

rayjanoka commented Nov 15, 2022 •

edited

Loading

zroubalik left a comment

zroubalik commented Nov 15, 2022 •

edited by github-actions bot

Loading

rayjanoka commented Nov 19, 2022 •

edited

Loading

JorTurFer commented Nov 21, 2022 •

edited by github-actions bot

Loading

JorTurFer commented Nov 21, 2022

rayjanoka commented Nov 23, 2022 •

edited

Loading

JorTurFer commented Nov 23, 2022

JorTurFer commented Nov 23, 2022

rayjanoka commented Nov 29, 2022 •

edited

Loading

rayjanoka commented Nov 29, 2022

JorTurFer commented Nov 29, 2022 •

edited by github-actions bot

Loading

rayjanoka commented Nov 29, 2022 •

edited

Loading

tomkerkhove commented Dec 2, 2022 •

edited by github-actions bot

Loading

tomkerkhove commented Dec 2, 2022

rayjanoka commented Dec 2, 2022

zroubalik commented Dec 3, 2022 •

edited by github-actions bot

Loading

zroubalik left a comment

JetStream scaler should query the stream consumer leader when clustered #3564

JetStream scaler should query the stream consumer leader when clustered #3564

Conversation

rayjanoka commented Aug 18, 2022 • edited by zroubalik Loading

Checklist

goku321 commented Aug 18, 2022

rayjanoka commented Aug 18, 2022

goku321 commented Aug 21, 2022

rayjanoka commented Nov 11, 2022

rayjanoka commented Nov 14, 2022 • edited Loading

rayjanoka commented Nov 15, 2022 • edited Loading

zroubalik left a comment

Choose a reason for hiding this comment

zroubalik commented Nov 15, 2022 • edited by github-actions bot Loading

rayjanoka commented Nov 19, 2022 • edited Loading

JorTurFer commented Nov 21, 2022 • edited by github-actions bot Loading

JorTurFer commented Nov 21, 2022

rayjanoka commented Nov 23, 2022 • edited Loading

JorTurFer commented Nov 23, 2022

JorTurFer commented Nov 23, 2022

rayjanoka commented Nov 29, 2022 • edited Loading

rayjanoka commented Nov 29, 2022

JorTurFer commented Nov 29, 2022 • edited by github-actions bot Loading

rayjanoka commented Nov 29, 2022 • edited Loading

tomkerkhove commented Dec 2, 2022 • edited by github-actions bot Loading

tomkerkhove commented Dec 2, 2022

rayjanoka commented Dec 2, 2022

zroubalik commented Dec 3, 2022 • edited by github-actions bot Loading

zroubalik left a comment

Choose a reason for hiding this comment

rayjanoka commented Aug 18, 2022 •

edited by zroubalik

Loading

rayjanoka commented Nov 14, 2022 •

edited

Loading

rayjanoka commented Nov 15, 2022 •

edited

Loading

zroubalik commented Nov 15, 2022 •

edited by github-actions bot

Loading

rayjanoka commented Nov 19, 2022 •

edited

Loading

JorTurFer commented Nov 21, 2022 •

edited by github-actions bot

Loading

rayjanoka commented Nov 23, 2022 •

edited

Loading

rayjanoka commented Nov 29, 2022 •

edited

Loading

JorTurFer commented Nov 29, 2022 •

edited by github-actions bot

Loading

rayjanoka commented Nov 29, 2022 •

edited

Loading

tomkerkhove commented Dec 2, 2022 •

edited by github-actions bot

Loading

zroubalik commented Dec 3, 2022 •

edited by github-actions bot

Loading