Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kanister prometheus metrics: Controller Integration + ActionSet resolution counter exports #2247

Merged
merged 18 commits into from
Aug 25, 2023

Conversation

mellon-collie
Copy link
Contributor

@mellon-collie mellon-collie commented Aug 8, 2023

Change Overview

Following changes made under the kanister prometheus metrics helm feature flag:
-> This PR exports ActionSet resolution counter called kanister_action_set_resolutions_total, with "success" and "failure" label values - based on the result of the ActionSet.
-> This PR also integrates the Kanister-prometheus framework to the main Kanister controller interface. The integration involves the controller object taking in a prometheus registry and instantiating a metrics object that encapsulates all the Prometheus metric objects for the controller interface.

Pull request type

Please check the type of change your PR introduces:

  • 🚧 Work in Progress
  • 🌈 Refactoring (no functional changes, no api changes)
  • 🐹 Trivial/Minor
  • 🐛 Bugfix
  • 🌻 Feature
  • 🗺️ Documentation
  • 🤖 Test

Issues

  • fixes #issue-number

Test Plan

  1. TextExecActionSet unit-test has been modified to ensure that "success" and "failure" counter increments are made successfully on the resolution of an ActionSet.

  2. New unit-test called TestNilPrometheusRegistry has been added to test the behavior of the controller when a nil prometheus registry is passed and ensure that it does not panic.

  3. Manual tests: Multiple action sets were created from the mysql blueprint example and the /metrics endpoint has been tested to see if the metrics are correctly exported.

Details:
When the a new Kanister operator pod is deployed, the kanister_action_set_resolutions_total metric is set to 0 in the /metrics endpoint:
Screenshot 2023-08-09 at 11 41 19 AM

Following the mysql-blueprints example from the examples directory, 2 ActionSets were created which perform mysql backups, both of which succeeded:

kanctl create actionset --action backup --namespace kanister --blueprint mysql-blueprint --statefulset mysql-test/mysql-release --profile mysql-test/<profile-name> --secrets mysql=mysql-test/mysql-release

Following this, the /metrics displays the following counter value:

Screenshot 2023-08-09 at 11 43 07 AM
  • 💪 Manual
  • ⚡ Unit test
  • 💚 E2E

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

Thanks for submitting this pull request 🎉. The team will review it soon and get back to you.

If you haven't already, please take a moment to review our project contributing guideline and Code of Conduct document.

@infraq infraq added this to In Progress in Kanister Aug 8, 2023
Copy link
Contributor

@viveksinghggits viveksinghggits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
left some minor comments.

pkg/controller/controller.go Outdated Show resolved Hide resolved
pkg/controller/metrics.go Show resolved Hide resolved
pkg/controller/metrics.go Outdated Show resolved Hide resolved
pkg/controller/controller_test.go Outdated Show resolved Hide resolved
pkg/kancontroller/kancontroller.go Show resolved Hide resolved
@mellon-collie mellon-collie force-pushed the kanister-prometheus-metrics-resolution branch 5 times, most recently from 9b3a423 to 2ac8dd8 Compare August 9, 2023 17:50
Copy link
Contributor

@viveksinghggits viveksinghggits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Base automatically changed from kanister-prometheus-skeleton-framework to master August 10, 2023 16:47
Kanister automation moved this from In Progress to Reviewer approved Aug 10, 2023
@viveksinghggits
Copy link
Contributor

I have approved, we can wait for @pavannd1 to have a look as well.

@mellon-collie mellon-collie force-pushed the kanister-prometheus-metrics-resolution branch from 0d23528 to 743d6ab Compare August 14, 2023 16:54
go.mod Outdated Show resolved Hide resolved
@@ -435,17 +448,20 @@ func (c *Controller) runAction(ctx context.Context, t *tomb.Tomb, as *crv1alpha1
c.logAndSuccessEvent(ctx, fmt.Sprintf("Executing action %s", action.Name), "Started Action", as)
tp, err := param.New(ctx, c.clientset, c.dynClient, c.crClient, c.osClient, action)
if err != nil {
c.incrementActionSetResolutionCounterVec(ACTION_SET_COUNTER_VEC_LABEL_RES_FAILURE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remind me what the issue was with incrementing this in the caller of runAction i.e., handleActionSet inside the if block on line 421?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavannd1 There is no issue per se with respect to this PR. Considering that we do move the Inc() calls to line 421 now, and we decide to pair the resolution label dimension with another dimension, say action_type in future PRs, we would have to move back this Inc() from line 421 to the callee, because at line 421, we don't have the action_type information. Hence, I thought that it would be better to maintain the Inc() calls in the callee to avoid make an additional change when we add action_type dimension.

I think this would apply to any other label dimension we choose to use with the resolution dimension.

pkg/controller/controller_test.go Outdated Show resolved Hide resolved
@@ -630,6 +666,7 @@ func (s *ControllerSuite) TestExecActionSet(c *C) {
if !cancel {
err = s.waitOnActionSetState(c, as, final)
c.Assert(err, IsNil, Commentf("Failed case: %s", tc.name))
c.Assert(getCounterVecValue(s.ctrl.metrics.actionSetResolutionCounterVec, []string{tc.metricResolution}), Equals, oldValue+1, Commentf("Failed case: %s", tc.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the cancel case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're currently maintaining only success and failure resolutions as per our discussions. Hence cancel has essentially been ignored, and we will not be incrementing any values for the cancel case.

If we do decide to add cancel in the future, then we will have to add a new Inc() at the point where we figure out that the ActionSet has been cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing because TestExecActionSet has been set up to "expect" a failure in the cancel case, but we aren't asserting anything for now. Let me know if it's better to pass in blank resolution status during the unit test.

pkg/controller/controller_test.go Outdated Show resolved Hide resolved
pkg/controller/metrics.go Outdated Show resolved Hide resolved
@mellon-collie mellon-collie force-pushed the kanister-prometheus-metrics-resolution branch from 0ee8f78 to f365d37 Compare August 18, 2023 23:16
Copy link
Contributor

@pavannd1 pavannd1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit 536ac9a into master Aug 25, 2023
14 checks passed
Kanister automation moved this from Reviewer approved to Done Aug 25, 2023
@mergify mergify bot deleted the kanister-prometheus-metrics-resolution branch August 25, 2023 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants