Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JetStream scaler should query the stream consumer leader when clustered #3564

Merged

Conversation

rayjanoka
Copy link
Contributor

@rayjanoka rayjanoka commented Aug 18, 2022

I went to test drive the new JetStream scaler on my project and it started to have an issue. Eventually I figured out that the NATS monitoring endpoint isn't reporting accurate metrics on stream consumers from all pods in a cluster.

I found that when running a cluster of nats pods, only the stream consumer's leader pod reports the accurate number of messages in the queue. The jetstream scaler would work fine for me at first as long as keda was connected to the nats consumer leader pod, but as soon as I bounced keda's connection to a nats consumer replica pod the num_pending value there counts up but never down, so my deployment just scales up and up and up.

➜ nats consumer info test-stream durable

Cluster Information:
                Name: nats
              Leader: nats-1   <---- Leader Pod
             Replica: nats-0, current, seen 0.33s ago
             Replica: nats-2, current, seen 0.33s ago
# Consumer Leader (accurate)
➜ curl -s "http://nats-1.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 0,

# Replicas (counting up only)
➜ curl -s "http://nats-0.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,
➜ curl -s "http://nats-2.nats.nats.svc.cluster.local:8222/jsz?consumers=true" | grep num_pending
              "num_pending": 6,

We are able to discover the jetstream consumer's leader via the existing monitoring endpoint call.

$ curl -s "http://nats.nats.svc.cluster.local:8222/jsz?consumers=true" | jq '.account_details[0].stream_detail[0].consumer_detail[0].cluster.leader'
"nats-0"

To get the accurate count to the scaler I wrote a change to make a 2nd request directly to that consumer leader pod via the headless svc, ex consumer-leader-pod.nats.nats.svc.cluster.local.

NOTE: This fix will only work for clusters who have the same number of pods as stream replicas, ex. 3 pods and 3 stream replicas or 5 pods and 5 stream replicas.

➜ nats stream info test-stream
Information for Stream test-stream

Configuration:

             Subjects: sub
             Replicas: 3   <---- stream replica setting here

this is resolved below #3564 (comment)
If there are less stream replicas than pods, ex. 3 pods and only 1 stream replica, NATS will only display metrics for that consumer on the single pod that the stream is assigned to, the other pods have no record of the consumer at all so it will only find the metric if the k8s svc happens to terminate you to that particular pod. Without any other clues to figure out which pod has the metric I think we'd have to rotate through each NATS pod blindly until we found the pod, not great. (I can document this limitation in the keda-docs for now)

@goku321 for visibility - thanks for contributing this!

I'll send a ticket over to the NATS side as well and see if they can work to provide accurate metrics across all NATS servers in a cluster so no matter what server we connect to we see everything.

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • Tests have been added
  • [NA] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • [NA] A PR is opened to update the documentation on (repo) (if applicable)
  • Changelog has been updated and is aligned with our changelog requirements

Fixes #3860

Relates to #

@rayjanoka rayjanoka requested a review from a team as a code owner August 18, 2022 18:42
@goku321
Copy link
Contributor

goku321 commented Aug 18, 2022

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

@rayjanoka
Copy link
Contributor Author

@rayjanoka Thank you so much for giving it a test drive! Really appreciate all your efforts :)

Now, I'm thinking if we really need the single server JetStream tests and instead add tests for clustered JetStream. What do you think?

@goku321 nice! I did notice we had separate tests like that for redis and I think it is probably a good idea here as well. I don't have much experience writing tests, and it might take me a while to circle back to this to give those a shot. If you or anyone else would like to take a crack at it that would be fine.

@goku321
Copy link
Contributor

goku321 commented Aug 21, 2022

@rayjanoka Sure. I can start with the tests in a week or so. I hope that's okay :)

@rayjanoka rayjanoka requested a review from JorTurFer September 16, 2022 21:09
@rayjanoka
Copy link
Contributor Author

I'm back on this, I'm going to work on putting this in a function and writing some tests.

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 5 times, most recently from 3c75054 to 7db998a Compare November 14, 2022 20:23
@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 14, 2022

ok, this is looking good! I did some serious surgery and we're at 96% test coverage! 😆

I'm going to give this a run in my lab and I'll let ya know how it goes.

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 5 times, most recently from 0c39690 to 2e36647 Compare November 14, 2022 22:22
@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 15, 2022

I think this is ready for review, it tested ok in my environment.

  • De-duplicated code
  • Lots of new tests
  • All tests passed!
  • Coverage: image

Let me know if I need to do anything else!

@rayjanoka rayjanoka changed the title JetStream scaler query the stream consumer's leader pod when clustered JetStream scaler use stream consumer leader pod metric when clustered Nov 15, 2022
@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 2e36647 to 4d44456 Compare November 15, 2022 00:18
@rayjanoka rayjanoka changed the title JetStream scaler use stream consumer leader pod metric when clustered JetStream scaler use stream consumer leader metric when clustered Nov 15, 2022
@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 4d44456 to 4516b9c Compare November 15, 2022 03:34
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is looking good, there are a few leftovers:

  • could you please create a new issue describing the problem and reference it in the changelog?
  • the test coverage is great! I think that @goku321 was talking about e2e test that we have, to support clustered setup: Provide scaler for NATS Jetstream #2391

CHANGELOG.md Outdated Show resolved Hide resolved
@zroubalik
Copy link
Member

zroubalik commented Nov 15, 2022

/run-e2e nats*
Update: You can check the progress here

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 4516b9c to b5bfecc Compare November 15, 2022 19:38
@rayjanoka rayjanoka changed the title JetStream scaler use stream consumer leader metric when clustered JetStream scaler should query the stream consumer leader when clustered Nov 15, 2022
@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch 2 times, most recently from 5d57c2c to ad7fd89 Compare November 15, 2022 20:12
@rayjanoka rayjanoka closed this Nov 19, 2022
@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from af83fe0 to f8ea5dd Compare November 19, 2022 08:14
@rayjanoka rayjanoka reopened this Nov 19, 2022
@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 19, 2022

Updates

  • new clustered e2e test with 3 nats nodes - takes about 6.5 mins to run
    • tests all combinations of stream replicas (1, 2, & 3) in the cluster
  • lowered the number of messages to process in the standalone test so it is faster
  • adding all combinations of streams replicas to the e2e test required supporting all the stream replicas. This required a bit more logic back in the main code to fall back to scanning each nats node to blindly discover the leader node when sometimes no stream/consumer context is returned in the initial request to nats (when the response is from a node where the stream is not running).
  • we are caching the leader once we find it so we only need to make a single request, there are multiple calls to discover in the first run, or if the stream leader moves to another node

Am I able to trigger the e2e test here? or if one of the crew could start it we can see if it works here.

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 7693425 to 1b46ff3 Compare November 19, 2022 08:43
@JorTurFer
Copy link
Member

JorTurFer commented Nov 21, 2022

/run-e2e nats*
Update: You can check the progress here

@JorTurFer
Copy link
Member

Am I able to trigger the e2e test here? or if one of the crew could start it we can see if it works here.

No you can't. To trigger them, it's required to be in a specific team in the org. I have already triggered

@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 23, 2022

ok, it looks like it worked...I can't see the log, but I see a green check.

I'll take one more look over everything on Monday, then we can do the review and get this finished. thanks!

@JorTurFer
Copy link
Member

You can find the specific workflow logs following the link in the comment
image

There, you have to go to the Execute e2e tests job:
image

And there you only need to check the step:
image

@JorTurFer
Copy link
Member

I'll take one more look over everything on Monday, then we can do the review and get this finished. thanks!

Perfect! Thanks for your contribution!

@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 29, 2022

You can find the specific workflow logs following the link in the comment

ah hah! of course, the log looks good good!

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 1b46ff3 to 122e4e5 Compare November 29, 2022 03:10
@rayjanoka rayjanoka requested review from goku321 and zroubalik and removed request for JorTurFer and goku321 November 29, 2022 03:11
@rayjanoka
Copy link
Contributor Author

ok, I just rebased, should be good for review and delivery! @goku321 @JorTurFer @zroubalik

@JorTurFer
Copy link
Member

JorTurFer commented Nov 29, 2022

/run-e2e nats*
Update: You can check the progress here

@rayjanoka
Copy link
Contributor Author

rayjanoka commented Nov 29, 2022

/run-e2e nats* Update: You can check the progress here

LGTM!

the clustered test is long, but I helped a bit overall by cutting 2.5 mins off the standalone run by pushing less messages through in the test, 1000 -> 300.

standalone e2e test: 59 secs
clustered e2e test: 7.3 mins

@tomkerkhove
Copy link
Member

tomkerkhove commented Dec 2, 2022

/run-e2e nats*
Update: You can check the progress here

@tomkerkhove
Copy link
Member

@JorTurFer @zroubalik Can you re-review this PR please to see if we need to change things before our release on Thursday?
@rayjanoka can you fix the merge conflict please?

@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 122e4e5 to 7ed83a5 Compare December 2, 2022 22:18
@rayjanoka
Copy link
Contributor Author

@JorTurFer @zroubalik Can you re-review this PR please to see if we need to change things before our release on Thursday? @rayjanoka can you fix the merge conflict please?

fixed! thanks @tomkerkhove!

…when clustered

Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>
@rayjanoka rayjanoka force-pushed the jetstream_cluster_consumer_leader branch from 7ed83a5 to 61a844d Compare December 2, 2022 23:00
@zroubalik
Copy link
Member

zroubalik commented Dec 3, 2022

/run-e2e nats*
Update: You can check the progress here

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

thanks a lot for the contribution!

@zroubalik zroubalik merged commit eb6dd2f into kedacore:main Dec 5, 2022
josephangbc pushed a commit to josephangbc/keda that referenced this pull request Dec 6, 2022
…when clustered (kedacore#3564)

Signed-off-by: Ray <18078751+rayjanoka@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JetStream scaler should query the stream consumer leader when clustered
5 participants