OCPBUGS-34227: fix: move checks into readiness and introduce grpc health #658

jakobmoellerdev · 2024-07-11T08:26:41Z

This avoids the pod shutdown in case of healthiness errors, e.g. when the device class setup takes extremely long. This has the side effect of vgmanager possibly hanging up if there is a hangup in one of its servers, but it will show a not Ready State if that happens. Also now the node server only reports healthiness once this has been successfully completed.

This concretely fixes a situation in which the lvmd setup takes extremely long due to thinpool provisioning (e.g. 5 mins for zeroing a huge array of disks) and the kubelet would kill the pod due to the failing healthiness.

openshift-ci-robot · 2024-07-11T08:26:47Z

@jakobmoellerdev: This pull request references Jira Issue OCPBUGS-34227, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @radeore

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This avoids the pod shutdown in case of healthiness errors, e.g. when the device class setup takes extremely long. This has the side effect of vgmanager possibly hanging up if there is a hangup in one of its servers, but it will show a not Ready State if that happens. Also now the node server only reports healthiness once this has been successfully completed.

This concretely fixes a situation in which the lvmd setup takes extremely long due to thinpool provisioning (e.g. 5 mins for zeroing a huge array of disks) and the kubelet would kill the pod due to the failing healthiness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-07-11T08:28:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jakobmoellerdev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jakobmoellerdev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-07-11T08:51:08Z

Codecov Report

Attention: Patch coverage is 42.39130% with 53 lines in your changes missing coverage. Please review.

Project coverage is 58.22%. Comparing base (fec2052) to head (b9f10bd).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #658      +/-   ##
==========================================
+ Coverage   57.91%   58.22%   +0.31%     
==========================================
  Files          53       55       +2     
  Lines        4215     4285      +70     
==========================================
+ Hits         2441     2495      +54     
- Misses       1532     1542      +10     
- Partials      242      248       +6

Files	Coverage Δ
internal/controllers/lvmcluster/controller.go	`68.83% <100.00%> (+6.98%)`	⬆️
internal/csi/grpc_runner.go	`0.00% <0.00%> (ø)`
internal/csi/health.go	`0.00% <0.00%> (ø)`
...ternal/controllers/lvmcluster/resource/csi_node.go	`72.54% <72.54%> (ø)`
cmd/vgmanager/vgmanager.go	`0.00% <0.00%> (ø)`

... and 3 files with indirect coverage changes

jakobmoellerdev · 2024-07-11T08:55:01Z

/override ci/prow/snyk-deps

openshift-ci · 2024-07-11T08:56:45Z

@jakobmoellerdev: Overrode contexts on behalf of jakobmoellerdev: ci/prow/snyk-deps

In response to this:

/override ci/prow/snyk-deps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jakobmoellerdev · 2024-07-11T12:14:22Z

/retest

jakobmoellerdev · 2024-07-15T10:32:33Z

/test e2e-aws
/test e2e-aws-single-node

jakobmoellerdev · 2024-07-15T12:00:56Z

/test e2e-aws

jakobmoellerdev · 2024-07-15T13:34:16Z

/test e2e-aws
/test e2e-aws-single-node

This avoids the pod shutdown in case of healthiness errors, e.g. when the device class setup takes extremely long. This has the side effect of vgmanager possibly hanging up if there is a hangup in one of its servers, but it will show a not Ready State if that happens. Also now the node server only reports healthiness once this has been successfully completed Signed-off-by: Jakob Möller <jmoller@redhat.com>

jakobmoellerdev · 2024-07-22T11:09:49Z

/override ci/prow/snyk-deps

openshift-ci · 2024-07-22T11:10:06Z

@jakobmoellerdev: Overrode contexts on behalf of jakobmoellerdev: ci/prow/snyk-deps

In response to this:

/override ci/prow/snyk-deps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jakobmoellerdev · 2024-07-22T12:13:26Z

/test e2e-aws

suleymanakbas91 · 2024-07-22T13:33:03Z

/lgtm

openshift-ci · 2024-07-22T14:51:09Z

@jakobmoellerdev: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-07-22T14:55:33Z

@jakobmoellerdev: Jira Issue OCPBUGS-34227: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-34227 has been moved to the MODIFIED state.

In response to this:

This avoids the pod shutdown in case of healthiness errors, e.g. when the device class setup takes extremely long. This has the side effect of vgmanager possibly hanging up if there is a hangup in one of its servers, but it will show a not Ready State if that happens. Also now the node server only reports healthiness once this has been successfully completed.

This concretely fixes a situation in which the lvmd setup takes extremely long due to thinpool provisioning (e.g. 5 mins for zeroing a huge array of disks) and the kubelet would kill the pod due to the failing healthiness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested a review from radeore July 11, 2024 08:26

openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 11, 2024

openshift-ci bot requested review from jerpeter1 and suleymanakbas91 July 11, 2024 08:28

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2024

openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 11, 2024

jakobmoellerdev mentioned this pull request Jul 16, 2024

OCPBUGS-34227: fix: increase vgmanager startup timeout to 10 mins to cover long running volume group initialization #668

Merged

jakobmoellerdev force-pushed the OCPBUGS-34227 branch 2 times, most recently from d17a209 to 4b87fac Compare July 22, 2024 09:50

jakobmoellerdev force-pushed the OCPBUGS-34227 branch from 4b87fac to b9f10bd Compare July 22, 2024 10:36

openshift-ci bot assigned suleymanakbas91 Jul 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2024

openshift-merge-bot bot merged commit 4788d22 into openshift:main Jul 22, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-34227: fix: move checks into readiness and introduce grpc health #658

OCPBUGS-34227: fix: move checks into readiness and introduce grpc health #658

jakobmoellerdev commented Jul 11, 2024

openshift-ci-robot commented Jul 11, 2024

openshift-ci bot commented Jul 11, 2024

codecov-commenter commented Jul 11, 2024 •

edited

Loading

jakobmoellerdev commented Jul 11, 2024

openshift-ci bot commented Jul 11, 2024

jakobmoellerdev commented Jul 11, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 22, 2024

openshift-ci bot commented Jul 22, 2024

jakobmoellerdev commented Jul 22, 2024

suleymanakbas91 commented Jul 22, 2024

openshift-ci bot commented Jul 22, 2024

openshift-ci-robot commented Jul 22, 2024

OCPBUGS-34227: fix: move checks into readiness and introduce grpc health #658

OCPBUGS-34227: fix: move checks into readiness and introduce grpc health #658

Conversation

jakobmoellerdev commented Jul 11, 2024

openshift-ci-robot commented Jul 11, 2024

openshift-ci bot commented Jul 11, 2024

codecov-commenter commented Jul 11, 2024 • edited Loading

Codecov Report

jakobmoellerdev commented Jul 11, 2024

openshift-ci bot commented Jul 11, 2024

jakobmoellerdev commented Jul 11, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 15, 2024

jakobmoellerdev commented Jul 22, 2024

openshift-ci bot commented Jul 22, 2024

jakobmoellerdev commented Jul 22, 2024

suleymanakbas91 commented Jul 22, 2024

openshift-ci bot commented Jul 22, 2024

openshift-ci-robot commented Jul 22, 2024

codecov-commenter commented Jul 11, 2024 •

edited

Loading