Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActionSet handling perf tweaks #1650

Merged
merged 9 commits into from
Oct 5, 2022
Merged

Conversation

ed-shilo
Copy link
Contributor

@ed-shilo ed-shilo commented Sep 29, 2022

Change Overview

The proposed set of changes improves perf of Add ActionSet workflow handling in Kanister controller by fixing the following issues:

  1. Currently controller's onAdd event handler starts an ActionSet execution goroutines after prevalidation stage, which contains several API calls to K8S apiserver. These calls take several seconds before spawning a goroutine, and considering the fact that go-client cache calls onAdd sequentially this adds significant time delay before starting ActionSets if multiple ones are submitted simultaneously.
  2. If multiple ActionSets are handled in parallel, this makes a load on k8s api servers and makes it less responsive. It was measured that for 10 simultaneous ActionSet executions each k8s API call may take up to 1 sec during execution. To reduce server load it was decided to reduce number of k8s API calls where it's possible. For example, all calls to get ActionSet blueprint were replaced with a single call to Blueprint's List method with further lookup through received Blueprints.

As a proof of perf improvement I measured execution time of several ActionSets submitted simultaneously depending on their count and phase count in a corresponding blueprint. A test ActionSet just performed single KubeExec call to minimize phase execution impact.

Results:

Phases per Blueprint Parallel ActionSets Old execution time New execution time
1 10 35 sec 24 sec
1 20 1 min 12 sec 1 min 1 sec
1 30 1 min 50 sec 1 min 16 sec
3 10 56 sec 42 sec
3 20 1 min 53 sec 1 min 22 sec
3 30 2 min 51 sec 2 min 4 sec
5 10 1 min 12 sec 1 min 2 sec
5 20 2 min 26 sec 2 min 4 sec
5 30 3 min 26 sec 3 min 0 sec

Pull request type

Please check the type of change your PR introduces:

  • 🚧 Work in Progress
  • 🌈 Refactoring (no functional changes, no api changes)
  • 🐹 Trivial/Minor
  • 🐛 Bugfix
  • 🌻 Feature
  • 🗺️ Documentation
  • 🤖 Test

Issues

Test Plan

  • 💪 Manual
  • ⚡ Unit test
  • 💚 E2E

@github-actions
Copy link
Contributor

Thanks for submitting this pull request 🎉. The team will review it soon and get back to you.

If you haven't already, please take a moment to review our project contributing guideline and Code of Conduct document.

@infraq infraq added this to In Progress in Kanister Sep 29, 2022
@ed-shilo ed-shilo force-pushed the keu/ed-shilo/K10-12812-as-perf branch from 3123cad to 4d54af2 Compare September 30, 2022 15:03
Copy link
Contributor

@viveksinghggits viveksinghggits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will let someone else have a look as well. thank you, good find 💯
cc @ihcsim

@ed-shilo ed-shilo force-pushed the keu/ed-shilo/K10-12812-as-perf branch from 4d54af2 to 4dea2e0 Compare October 3, 2022 06:38
@ed-shilo
Copy link
Contributor Author

ed-shilo commented Oct 3, 2022

@ihcsim Could you please take a look at these changes too?

pkg/param/param.go Outdated Show resolved Hide resolved
pkg/controller/controller.go Outdated Show resolved Hide resolved
pkg/controller/controller.go Outdated Show resolved Hide resolved
@ed-shilo ed-shilo force-pushed the keu/ed-shilo/K10-12812-as-perf branch from 093a47c to 6773589 Compare October 4, 2022 09:38
Copy link
Contributor

@denisvmedia denisvmedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Kanister automation moved this from In Progress to Reviewer approved Oct 4, 2022
Copy link
Contributor

@pavannd1 pavannd1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ed-shilo 💯

@ed-shilo ed-shilo added the kueue label Oct 5, 2022
@mergify mergify bot merged commit c2a8e32 into master Oct 5, 2022
@mergify mergify bot deleted the keu/ed-shilo/K10-12812-as-perf branch October 5, 2022 07:37
Kanister automation moved this from Reviewer approved to Done Oct 5, 2022
Copy link
Contributor

@ihcsim ihcsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I left some questions and comments. If needed, we can address them in a follow-up PR.

if err := c.onAddActionSet(v); err != nil {
log.Error().WithError(err).Print("Callback onAddActionSet() failed")
}
t, ctx := c.LoadOrStoreTomb(context.Background(), v.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these are legacy code, but I am not sure what we gain by storing the tombs. Not like there is much savings from re-use, since the actionsets will have different names. FWIW, it also doesn't look like it's expensive to create a tomb anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, originally the tomb was used for graceful action set deletion. In fact each submitted goroutine doesn't listen for tomb signals, but I suppose it's better to leave them wrapped with the tomb.

}

func (c *Controller) onAddActionSet(ctx context.Context, t *tomb.Tomb, as *crv1alpha1.ActionSet) error {
as, err := c.crClient.CrV1alpha1().ActionSets(as.GetNamespace()).Get(ctx, as.GetName(), v1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why this call is necessary, since we are already passing in as?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it's unnecessary, need to remove

Comment on lines +205 to +208
bpMap := make(map[string]*crv1alpha1.Blueprint)
for _, bp := range bps.Items {
bpMap[bp.Name] = bp
}
Copy link
Contributor

@ihcsim ihcsim Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of re-listing and storing all blueprints for a single actionset? An actionset is mapped to just one blueprint.

Also, how about just use the K8s client-go informers and its Lister() method to retrieve the needed blueprint and maintain the cache, instead of doing it ourselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is validated, but if I'm not mistaken according to the code actions in an actionset may have associations with different blueprints.
Using an informer will add need to wait after adding a blueprint and before adding an actions due to cache resync. Do we want such behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really @ed-shilo. An actionset can point to just one blueprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created blueprints not being available immediately might be a concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blueprint informer watches for new blueprints and updates its indexer accordingly. Try add a new blueprint, the blueprint informer should respond immediately. I am not sure why we are using the low level cache directly here, but it looks like we are already using client-go caching mechanism.

Re-sync applies to existing resources, when client-go tries to reconcile any differences between the cache and the data in its store.

@ed-shilo You are right about the 1:n mapping between actionset and blueprint. It looks like an actionset can have multiple actions, each of which can point to whichever blueprint it wants, even though downstream normally associate one actionset to one blueprint. Not sure why the API is designed that way.

Copy link
Contributor Author

@ed-shilo ed-shilo Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihcsim tried to use lister for getting blueprints, but as I see a newly created blueprint goes into cache after some small time. In tests this is observed in failing of attempts to add an action set linked with an already added blueprint due to cache miss. Are we OK with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, even though we don't use different blueprints for separate actions in an actionset, it does look like the code allows it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that caching isn't a major contributor to the bottleneck. If anything, I will be suspicious of code like this where every phase hits a recursive reflection call. It can get expensive if there are sophisticated Go template statements.

Copy link
Contributor

@ihcsim ihcsim Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it formally assumed that it should have the same value in case of multiple actions?

No, we can't assume that anymore. If the API let user do multiple blueprints, we have to support it.

Re: missing cache, did you use the informers lister or the cache lister. Looks like our code is using the cache one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihcsim I used the lister from the link you proposed previosly. As far as I see, it uses caching.

c.logAndErrorEvent(ctx, "Could not get blueprint:", "Blueprint not specified", err, as)
break
}
bp, ok := bps[a.Blueprint]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using the client-go informers to retrieve the blueprint from its cache, instead of managing the cache ourselves.

Copy link
Contributor Author

@ed-shilo ed-shilo Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about adding need to wait between bp and as creation.

if err != nil {
return errors.WithStack(err)
}
if err := validate.ActionSet(as); err != nil {
return err
}
bps, err := c.createBpCache(ctx, as.GetNamespace())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get expensive during a controller restart or if the actionset get re-queued. See #1267. IIRC, on restart, the controller will call onAddActionSet() on all existing actionset resources. With this change, the controller will create a map of all blueprints for every known actionset.

ed-shilo added a commit that referenced this pull request Oct 10, 2022
@ed-shilo ed-shilo mentioned this pull request Oct 10, 2022
10 tasks
@ed-shilo
Copy link
Contributor Author

ed-shilo commented Oct 10, 2022

@ihcsim @viveksinghggits @pavannd1 I've opened the follow-up PR #1668, which addresses the rest of comments left after merge. I decided to return back direct retrievals of blueprints from the API server, as it seems that using caches brings much more problems in comparison with its contribution to overall workflow runtime.

ed-shilo added a commit that referenced this pull request Oct 10, 2022
ed-shilo added a commit that referenced this pull request Oct 11, 2022
mergify bot pushed a commit that referenced this pull request Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

[BUG] Poor performance of ActionSet processing
5 participants