Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ScalersCache to reuse scalers unless they need changing #2187

Merged
merged 2 commits into from
Nov 9, 2021

Conversation

ahmelsayed
Copy link
Contributor

@ahmelsayed ahmelsayed commented Oct 12, 2021

Closes #1121

Provide a description of what has been changed

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)
  • Tests have been added
  • [N/A] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
  • [N/A] A PR is opened to update the documentation on (repo) (if applicable)
  • Changelog has been updated

Fixes #1121

@ahmelsayed ahmelsayed requested a review from zroubalik as a code owner October 12, 2021 12:46
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for this!

I have a few questions and minor nits on imports formating :) I have done a very quick review, I will go through the code more properly later.

adapter/main.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
pkg/scaling/scale_handler_test.go Outdated Show resolved Hide resolved
Comment on lines 169 to 181
h.lock.RLock()
if cache, ok := h.scalerCaches[key]; ok {
h.lock.RUnlock()
return cache, nil
}
h.lock.RUnlock()

h.lock.Lock()
defer h.lock.Unlock()
if cache, ok := h.scalerCaches[key]; ok {
return cache, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on this part? I am not sure I get it. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?

Copy link
Contributor Author

@ahmelsayed ahmelsayed Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recheck state after acquiring the W lock on line 176. If a previous thread goroutine has already created a cache, use it. I can change it to a sync.Map with LoadOrStore but I was reading https://github.com/golang/go/blob/master/src/sync/map.go#L12-L26 and wasn't sure if using sync.Map is the best option here, but I wasn't sure.

This can be changed if I take a W lock at line 169, but I thought an R lock there will reduce contention. It's not a high throughput scenario, but that's the idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! thanks, I missed the RUnlock call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore. I'll leave it up to you, though, since the code will change in a non-trivial way.

at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect much contention here. The code paths I could identify that might cause contention are:

  1. Multiple ScaledObject firing their pollingInterval at exactly the same time.
  2. If concurrent reconciliation is enabled.
  3. If the metric adapter gets multiple requests for the same metric value at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahmelsayed would it make sense to just acquire a write lock, do the check, and then handle a cache miss then, to simplify this code?

pkg/scaling/scale_handler.go Outdated Show resolved Hide resolved
pkg/scaling/cache/scalers_cache_test.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
@zroubalik zroubalik added this to the v2.5.0 milestone Oct 12, 2021
Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ahmelsayed - I left a few comments, mostly nits/ideas. One general comment: it would be good to add some Godoc comments, particularly on the ScalersCache

adapter/main.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
adapter/main.go Outdated Show resolved Hide resolved
controllers/keda/suite_test.go Outdated Show resolved Hide resolved
Comment on lines 41 to 44
func NewScalerCache(scalers []scalers.Scaler, factories []func() (scalers.Scaler, error), logger logr.Logger, recorder record.EventRecorder) (*ScalersCache, error) {
if len(scalers) != len(factories) {
return nil, fmt.Errorf("scalers and factories must match")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea - make the below scalerBuilder public and take in a []ScalerBuilder parameter instead of the scalers and factories ones. you wouldn't need to do this error check and callers wouldn't need to know to ensure len(scalers) == len(factories). WDYT?

pkg/scaling/cache/scalers_cache.go Show resolved Hide resolved
pkg/scaling/cache/scalers_cache.go Outdated Show resolved Hide resolved
Comment on lines 169 to 181
h.lock.RLock()
if cache, ok := h.scalerCaches[key]; ok {
h.lock.RUnlock()
return cache, nil
}
h.lock.RUnlock()

h.lock.Lock()
defer h.lock.Unlock()
if cache, ok := h.scalerCaches[key]; ok {
return cache, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 169-174 look to be the internal cache check, but @ahmelsayed are lines 178-180 duplicates?

pkg/scaling/scale_handler.go Outdated Show resolved Hide resolved
@JorTurFer
Copy link
Member

JorTurFer commented Oct 12, 2021

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right?
Should these changes related with the lifecycle be documented? (for example in https://github.com/kedacore/keda/blob/main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

@ahmelsayed
Copy link
Contributor Author

ahmelsayed commented Oct 12, 2021

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right? Should these changes related with the lifecycle be documented? (for example in main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

That's correct. The behavior/contract shouldn't change at all though, so I wasn't sure if anything in particular needs to be documented.

Behavior before:

  1. Create Scaler (with latest secrets/trigger auth values)
  2. Check Scaler
  3. Close Scaler
  4. Sleep for pollingInterval
  5. goto 1

if ScaledObject has changed, cancel loop above

Behavior after:

  1. Create or get existing Scaler
  2. Check Scaler
  3. Refresh Scaler on error (in case a secret or authentication value changed in the mean time, since we don't/can't watch all secret sources.
  4. Check Scaler
  5. Sleep for pollingInterval
  6. goto 1

if ScaledObject has changed, invalidate the cache.

So from a user/scaler author prospective, there is no change. If a scaler today is having a long lived connection and implements Close() correctly, it should work.


@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

Before this will cause once an hour to create a SQL connection that's short-lived.

After there will always be 100 open connections to the MySQL server.

@arschles
Copy link
Contributor

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

what behavior specifically? whether a scaler applied to a Scaled{Object, Job} should be cached?

@JorTurFer
Copy link
Member

JorTurFer commented Oct 13, 2021

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

In base of this use case, maybe we should support configure it by trigger, I mean, inside the same ScaledObject we can have several triggers, some of them for long time and another for short time 🤔
I think that reach this behavior could be a bit tricky, so for me it's enough if we can define it at ScaledObject level

So from a user/scaler author prospective, there is no change. If a scaler today is having a long lived connection and implements Close() correctly, it should work.

That's truth, and if we support enabling and disabling the cache, the behavior is exactly the same, you are right (in my previous comment I was thinking in the internal objects lifecycle and the requirement of thinking in them as ephemeral objects, but without cache they are ephemeral, so the behavior is exactly the same)

@zroubalik
Copy link
Member

zroubalik commented Oct 13, 2021

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

Probably? Don't have a strong opinion on this.

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.

Before this will cause once an hour to create a SQL connection that's short-lived.

After there will always be 100 open connections to the MySQL server.

If I am not mistaken, HPA is asking Metrics Server for particular metrics periodically, therefore the MySQL server metrics will be scraped anyway. We would have to cache the metric value in as well in order to do the actual check once an hour 🤔

Btw I had been thinking about this feature some time before, current pollingInterval applies only to Operator, not Metrics Server. If we cache the metric value in Metrics Server the pollingInterval could be then applicaple there as well. Though not sure if this is somehow useful, but it should limit the number of requests to the external service, if that's users concern.

@zroubalik
Copy link
Member

zroubalik commented Oct 13, 2021

This PR modifies the behavior related with the life cycle of the scalers, basically now we are going to keep them alive until a change requires an update, right? Should these changes related with the lifecycle be documented? (for example in main/CREATE-NEW-SCALER.md#lifecycle-of-a-scaler)

That's correct. The behavior/contract shouldn't change at all though, so I wasn't sure if anything in particular needs to be documented.

Behavior before:

1. Create Scaler (with latest secrets/trigger auth values)

2. Check Scaler

3. Close Scaler

4. Sleep for `pollingInterval`

5. goto `1`

if ScaledObject has changed, cancel loop above

Behavior after:

1. Create or get existing Scaler

2. Check Scaler

3. Refresh Scaler on error (in case a secret or authentication value changed in the mean time, since we don't/can't watch all secret sources.

4. Check Scaler

5. Sleep for `pollingInterval`

6. goto `1`

if ScaledObject has changed, invalidate the cache.

I agree we should document this for scalers developers. I think that similar description to the one above is enough.

@zroubalik zroubalik changed the title Add ScalersCache to reuse scales unless they need changing Add ScalersCache to reuse scalers unless they need changing Oct 13, 2021
@ahmelsayed
Copy link
Contributor Author

@arschles @JorTurFer @tomkerkhove @zroubalik do you think this behavior should be globally configurable (or configurable per ScaledObject)?

Probably? Don't have a strong opinion on this.

The main scenario I can think of is someone having a large number of ScaledObject that don't need to be checked frequently. For example, 100 MySQL scalers that need to be checked once an hour.
Before this will cause once an hour to create a SQL connection that's short-lived.
After there will always be 100 open connections to the MySQL server.

If I am not mistaken, HPA is asking Metrics Server for particular metrics periodically, therefore the MySQL server metrics will be scraped anyway. We would have to cache the metric value in as well in order to do the actual check once an hour 🤔

Btw I had been thinking about this feature some time before, current pollingInterval applies only to Operator, not Metrics Server. If we cache the metric value in Metrics Server the pollingInterval could be then applicaple there as well. Though not sure if this is somehow useful, but it should limit the number of requests to the external service, if that's users concern.

That's a good point @zroubalik. I'll just add a feature env var that will invalidate the cache right after every check for now (effectively closing the scaler to have the same old behavior) in case the new behavior impacts someone. How does that sound?

@zroubalik
Copy link
Member

zroubalik commented Oct 15, 2021

That's a good point @zroubalik. I'll just add a feature env var that will invalidate the cache right after every check for now (effectively closing the scaler to have the same old behavior) in case the new behavior impacts someone. How does that sound?

So do you propose to introduce global env var that will affect all ScaledObject or do per ScaledObjects setting? I am fine with both approaches, or we can introduce the per ScaledObject option later, if there's a demand.

adapter/main.go Outdated

go func() {
if err := mgr.Start(context.Background()); err != nil {
panic(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea here: rather than panic, you could return an errgroup or <-chan error from runScaledObjectController so that the caller can choose what to do with the error in this goroutine?

doubtful this is necessary for this PR, and I'm unsure whether that would be an improvement, holistically, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. The caller takes <-chan struct{} as a stopCh. I changed it to return that in

2381bfa (#2187)

pkg/scaling/cache/scalers_cache.go Outdated Show resolved Hide resolved
Comment on lines 279 to 266
if err != nil {
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the previous if err != nil block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a change to refresh the scaler and call IsActive again. so the err checked here is either from L270 or L273

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. Sorry I merged a suggestion without thinking about it too much. This has to update the outer err in case the scaler is still in an error state even after refreshing all the secrets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/scaling/cache/scalers_cache.go Show resolved Hide resolved
pkg/scaling/cache/scalers_cache_test.go Outdated Show resolved Hide resolved
pkg/scaling/cache/scalers_cache_test.go Outdated Show resolved Hide resolved
Comment on lines 169 to 181
h.lock.RLock()
if cache, ok := h.scalerCaches[key]; ok {
h.lock.RUnlock()
return cache, nil
}
h.lock.RUnlock()

h.lock.Lock()
defer h.lock.Unlock()
if cache, ok := h.scalerCaches[key]; ok {
return cache, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it! thanks, I missed the RUnlock call there. since this is pretty subtle, I think at a minimum it should have a comment explaining that there's a load or store operation going on here. better would be to use an abstraction like (sync.Map).LoadOrStore. I'll leave it up to you, though, since the code will change in a non-trivial way.

at a higher level, though, is there likely to be a lot of contention with this code? I'm asking to get a feel for whether it's worth using that read lock for the initial check.

pkg/scaling/scale_handler.go Outdated Show resolved Hide resolved
Complete(r)
}

type MetricsScaledJobReconciler struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job related Metrics Server Reconciler is not needed ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep :)

Comment on lines 279 to 266
if err != nil {
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean L260 or L263? Then the err on L263 is the one defined in the block on L262.

Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great @ahmelsayed. I left a few comments, but nothing seems blocking to me.

pkg/scaling/cache/scalers_cache.go Outdated Show resolved Hide resolved
Comment on lines 279 to 266
if err != nil {
scalerLogger.V(1).Info("Error getting scaler.IsActive, but continue", "Error", err)
c.recorder.Event(scaledJob, corev1.EventTypeWarning, eventreason.KEDAScalerFailed, err.Error())
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/scaling/cache/scalers_cache.go Show resolved Hide resolved
@zroubalik
Copy link
Member

zroubalik commented Oct 26, 2021

@ahmelsayed FYI, we just merged this #2202 for context propagation so you'll need a rebase.

Would be nice if you can in this PR at least partially tackle the other (2.) item from the context propagation related issue #2190

@arschles
Copy link
Contributor

arschles commented Oct 26, 2021

@ahmelsayed FYI, we just merged this #2202 for context propagation so you'll need a rebase.

@ahmelsayed this was my doing, sorry about all the conflicts. DM me if you'd like and I can help with the resolution.

Would be nice if you can in this PR at least partially tackle the other (2.) item from the context propagation related issue #2190

@zroubalik to avoid expanding the scope of this PR, I am happy to tackle (2) from that issue in a follow-up PR after this is merged.

@zroubalik
Copy link
Member

@ahmelsayed any update on this please? I'd like to start working on #2156 to have it for the upcoming release. The best thing would be to base it on your code (the controller part in Metric Server in particular, etc), so don't want start on this until this is merged, to avoid complex rebases 😄 Thx :)

@ahmelsayed
Copy link
Contributor Author

Sorry for the delay, I just rebased the PR, and I believe all the feedback has been addressed? (the conversation isn't easy to follow on github ui)

@ahmelsayed ahmelsayed force-pushed the ahmels/1121 branch 3 times, most recently from fbd69ca to 4fdf72a Compare November 9, 2021 09:38
Closes #1121

Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
@zroubalik
Copy link
Member

/run-e2e

@zroubalik zroubalik requested a review from JorTurFer November 9, 2021 12:21
@zroubalik
Copy link
Member

/run-e2e

@zroubalik
Copy link
Member

/run-e2e

There is an intermittent failure in scalers/azure-queue-trigger-auth.test.ts, the rest seem to be ok 🎉

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks a ton!

@zroubalik
Copy link
Member

@ahmelsayed could you please update the changelog with this contribution?

Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahmelsayed sorry I haven't been back to this PR in some time. I left one comment regarding a log line (a nit-pick, really). I'm happy to do it in a follow-up if you think it'd be a good idea. Let me know.

Regardless, LGTM

pkg/scaling/cache/scalers_cache.go Show resolved Hide resolved
Signed-off-by: Ahmed ElSayed <ahmels@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should scalers reuse opened connection/clients?
6 participants