Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HCO health metric #2204

Merged
merged 1 commit into from
Feb 8, 2023
Merged

Add HCO health metric #2204

merged 1 commit into from
Feb 8, 2023

Conversation

assafad
Copy link
Contributor

@assafad assafad commented Jan 11, 2023

Add a metric which indicates the health of HCO and its secondary resources, based on the aggregated conditions. The proposed metric is exposed both as a Prometheus metric (named kubevirt_hco_system_health_status), and as a field in HCO status (named systemHealthStatus).
Its value is set according to the following logic:

  • If at least one resource, out of the resources maintained by HCO, is degraded or not available:
    kubevirt_hco_system_health_status = 2 and systemHealthStatus = error

  • If all resources maintained by HCO are available and not degraded, but either one of the resources is progressing, or the reconciliation wasn't completed:
    kubevirt_hco_system_health_status = 1 and systemHealthStatus = warning

  • If all resources maintained by HCO are available, not degraded, not progressing and the reconciliation was completed:
    kubevirt_hco_system_health_status = 0 and systemHealthStatus = healthy

Signed-off-by: assafad aadmi@redhat.com

Reviewer Checklist

Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.

  • PR Message
  • Commit Messages
  • How to test
  • Unit Tests
  • Functional Tests
  • User Documentation
  • Developer Documentation
  • Upgrade Scenario
  • Uninstallation Scenario
  • Backward Compatibility
  • Troubleshooting Friendly

Release note:

Add HCO health metric

Jira-Ticket: https://issues.redhat.com/browse/CNV-24648

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jan 11, 2023
@kubevirt-bot kubevirt-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 11, 2023
@kubevirt-bot
Copy link
Contributor

Hi @assafad. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@assafad
Copy link
Contributor Author

assafad commented Jan 11, 2023

@sradco, @nunnatsa Can you please have a look?

@openshift-ci
Copy link

openshift-ci bot commented Jan 11, 2023

Hi @assafad. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link
Collaborator

coveralls commented Jan 11, 2023

Pull Request Test Coverage Report for Build 3949350243

  • 35 of 36 (97.22%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.08%) to 85.64%

Changes Missing Coverage Covered Lines Changed/Added Lines %
controllers/hyperconverged/hyperconverged_controller.go 31 32 96.88%
Files with Coverage Reduction New Missed Lines %
controllers/hyperconverged/hyperconverged_controller.go 1 81.27%
Totals Coverage Status
Change from base Build 3940385214: 0.08%
Covered Lines: 4783
Relevant Lines: 5585

💛 - Coveralls

@assafad assafad changed the title [WIP] Add HCO health metric WIP:Add HCO health metric Jan 11, 2023
@assafad assafad changed the title WIP:Add HCO health metric WIP: Add HCO health metric Jan 11, 2023
@assafad assafad changed the title WIP: Add HCO health metric Add HCO health metric Jan 12, 2023
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2023
Comment on lines +484 to +486
// SystemHealthStatus reflects the health of HCO and its secondary resources, based on the aggregated conditions.
// +optional
SystemHealthStatus string `json:"systemHealthStatus,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why do we need this field. K8s is getting away from a single status fields. This is why we have conditions.

Copy link
Contributor Author

@assafad assafad Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation was to have a way for users that don’t use Prometheus to check the operator's health.

@sradco is this field necessary for these users? maybe checking the metrics endpoint (which will include the new metric), without accessing Prometheus UI would be enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we have the conditions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a condition which is providing this information (healthy or not). If there is a condition, then this is enough imo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand There is no single condition for this

isConditionReconcileCompleteTrue := req.Conditions.IsStatusConditionTrue(hcov1beta1.ConditionReconcileComplete)

isSystemHealthStatusError := !isConditionAvailableTrue || isConditionDegradedTrue
if isSystemHealthStatusError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this logic. The result will be that we'll maybe get false negative, mostly during setup, until the system is fully function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what would be the reason for this false negative?
Is it related to the conditions' logic, or to the location in the reconciliation in which we update the system health?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyperConverged won't be available during setup. According to this code, we'll get an error metric in this case. Is that what we want? @sradco

Copy link
Collaborator

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments and questions.

@nunnatsa
Copy link
Collaborator

@assafad - please notice that you have 1 code smell.

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
Comment on lines 23 to 30

SystemHealthStatusHealthy = float64(0)
SystemHealthStatusWarning = float64(1)
SystemHealthStatusError = float64(2)
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be more robust, if you move to one setting function.

Suggested change
SystemHealthStatusHealthy = float64(0)
SystemHealthStatusWarning = float64(1)
SystemHealthStatusError = float64(2)
)
)
type SystemHealthStatus float64
const (
SystemHealthStatusHealthy SystemHealthStatus = iota
SystemHealthStatusWarning
SystemHealthStatusError
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done without the new type. Let me know what you think

@nunnatsa
Copy link
Collaborator

/ok-to-test

1 similar comment
@nunnatsa
Copy link
Collaborator

/ok-to-test

@kubevirt-bot kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 16, 2023
@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-azure

In response to this:

hco-e2e-upgrade-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 17, 2023

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure
okd-hco-e2e-upgrade-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-upgrade-index-aws
okd-hco-e2e-image-index-aws lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-gcp

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure, ci/prow/okd-hco-e2e-image-index-gcp, ci/prow/okd-hco-e2e-upgrade-index-aws

In response to this:

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure
okd-hco-e2e-upgrade-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-upgrade-index-aws
okd-hco-e2e-image-index-aws lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 17, 2023

hco-e2e-upgrade-prev-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-azure

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-azure

In response to this:

hco-e2e-upgrade-prev-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 17, 2023

hco-e2e-image-index-sno-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-sno-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-sno-aws

In response to this:

hco-e2e-image-index-sno-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/metrics/metrics.go Show resolved Hide resolved
@@ -80,6 +87,20 @@ var HcoMetrics = func() hcoMetrics {
)
},
},
HCOMetricSystemHealthStatus: {
fqName: "kubevirt_hco_system_health_status",
help: "Indicates whether the system health status is healthy (0), warning (1), or error (2)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "HCO system health status" same as "system health status"? If not, should we add that bit in the help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric aggregates the health of the system - HCO and its secondary resources. We use hco_ prefix for all HCO metrics, in order to identify and group metrics that are generated by it.
But sure, it is possible to add more information to the help. @sradco WDYT about this description?

Copy link
Collaborator

@sradco sradco Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assafad Please update the help text with this information. Like, "The metric aggregates the health of the system - HCO and its secondary resources, based on the aggregated conditions."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure

In response to this:

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

hco-e2e-upgrade-index-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-aws

In response to this:

hco-e2e-upgrade-index-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-gcp
hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws, ci/prow/hco-e2e-image-index-gcp

In response to this:

hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-gcp
hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add HCO health metric, which indicates the health of HCO and its secondary resources, based on the aggregated conditions
Signed-off-by: assafad <aadmi@redhat.com>
@sonarcloud
Copy link

sonarcloud bot commented Jan 18, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@openshift-ci
Copy link

openshift-ci bot commented Jan 18, 2023

@assafad: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/hco-e2e-image-index-aws cdb68e6 link true /test hco-e2e-image-index-aws
ci/prow/okd-hco-e2e-image-index-aws cdb68e6 link true /test okd-hco-e2e-image-index-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

hco-e2e-upgrade-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-sno-azure

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-sno-azure

In response to this:

hco-e2e-upgrade-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-sno-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

okd-hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/okd-hco-e2e-image-index-aws

In response to this:

okd-hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hco-bot
Copy link
Collaborator

hco-bot commented Jan 18, 2023

hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws

In response to this:

hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nunnatsa
Copy link
Collaborator

nunnatsa commented Feb 8, 2023

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2023
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2023
@kubevirt-bot kubevirt-bot merged commit bab7b17 into kubevirt:main Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants