Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node status updater now deletes the node entry in attach updates... #45923

Merged

Conversation

verult
Copy link
Contributor

@verult verult commented May 17, 2017

… when node is missing in NodeInformer cache.

  • Added RemoveNodeFromAttachUpdates as part of node status updater operations.

What this PR does / why we need it: Fixes issue of unnecessary node status updates when node is deleted.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #42438

Special notes for your reviewer: Unit tested added, but a more comprehensive test involving the attach detach controller requires certain testing functionality that is currently absent, and will require larger effort. Will be added at a later time.

There is an edge case caused by the following steps:

  1. A node is deleted and restarted. The node exists, but is not yet recognized by Kubernetes.
  2. A pod requiring a volume attach with nodeName specifically set to this node.

This would make the pod stuck in ContainerCreating state. This is low-pri since it's a specific edge case that can be avoided.

Release note:

Fix log spam due to unnecessary status update when node is deleted.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 17, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @verult. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 17, 2017
@verult
Copy link
Contributor Author

verult commented May 17, 2017

/assign @saad-ali @jingxu97

nodeName,
err)
nsu.actualStateOfWorld.SetNodeStatusUpdateNeeded(nodeName)
nsu.actualStateOfWorld.RemoveNodeFromAttachUpdates(nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to take different actions for nodeObj==nil and err != nil
When nodeObj==nil, it shows that API server does not have this object anymore, it should be safe to removeNode. But for err != nil, that indicates something wrong then retrieving the node object and node status updater should try it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeLister.Get() returns nil object only when there's an error, and returns an error only when the object is nil. However I can add in as a safeguard against future changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is different if node does not exist errors.NewNotFound(v1.Resource("node"), name)
you can check the error type to determine whether it is because of node exist or not.

@verult verult force-pushed the cxing/NodeStatusUpdaterFix branch from 8bc9fee to f41c953 Compare May 17, 2017 02:42
@verult
Copy link
Contributor Author

verult commented May 19, 2017

@k8s-bot ok to test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2017
@k8s-ci-robot
Copy link
Contributor

@verult: you can't request testing unless you are a kubernetes member.

In response to this:

@k8s-bot ok to test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@msau42
Copy link
Member

msau42 commented May 19, 2017

@k8s-bot ok to test

@msau42
Copy link
Member

msau42 commented May 19, 2017

/assign @saad-ali @jingxu97

@jingxu97
Copy link
Contributor

@k8s-bot ok to test

@jingxu97
Copy link
Contributor

/approve

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments.

@@ -64,14 +66,19 @@ func (nsu *nodeStatusUpdater) UpdateNodeStatuses() error {
nodesToUpdate := nsu.actualStateOfWorld.GetVolumesToReportAttached()
for nodeName, attachedVolumes := range nodesToUpdate {
nodeObj, err := nsu.nodeLister.Get(string(nodeName))
if nodeObj == nil || err != nil {
statusErr, isStatusError := err.(*errors.StatusError)
if isStatusError && statusErr.Status().Reason == metav1.StatusReasonNotFound {
// If node does not exist, its status cannot be updated, log error and
// reset flag statusUpdateNeeded back to true to indicate this node status
// needs to be updated again
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update this comment.


// Removes the given node from the record of attach updates. The node's entire
// volumesToReportAsAttached list is removed.
RemoveNodeFromAttachUpdates(nodeName types.NodeName) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this operation does not apply to both attachdetach and volumemanager actual_state_of_the_world, just add it to the ASW interface in controller/volume/attachdetach/cache/actual_state_of_world.go instead. Then you don't need to put a noop version of the method in volumemanager/cache/actual_state_of_world.go

@saad-ali
Copy link
Member

Make sure to keep the 1.5 version in sync with any changes in this PR as well

@@ -64,14 +66,19 @@ func (nsu *nodeStatusUpdater) UpdateNodeStatuses() error {
nodesToUpdate := nsu.actualStateOfWorld.GetVolumesToReportAttached()
for nodeName, attachedVolumes := range nodesToUpdate {
nodeObj, err := nsu.nodeLister.Get(string(nodeName))
if nodeObj == nil || err != nil {
statusErr, isStatusError := err.(*errors.StatusError)
if isStatusError && statusErr.Status().Reason == metav1.StatusReasonNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use error.IsNotFound() to check this

@verult verult force-pushed the cxing/NodeStatusUpdaterFix branch from 5c28946 to 455dba2 Compare May 24, 2017 21:08
@saad-ali
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2017
@saad-ali saad-ali added this to the v1.7 milestone May 24, 2017
@saad-ali
Copy link
Member

Make sure to fix the release note in your first comment on this page or change the release note label to no release note.

@verult verult force-pushed the cxing/NodeStatusUpdaterFix branch from 455dba2 to 0a0d758 Compare May 24, 2017 22:32
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2017
@verult
Copy link
Contributor Author

verult commented May 24, 2017

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 24, 2017
@verult
Copy link
Contributor Author

verult commented May 25, 2017

@k8s-bot pull-kubernetes-kubemark-e2e-gce test this

… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
@verult verult force-pushed the cxing/NodeStatusUpdaterFix branch from 0a0d758 to f9dc2d5 Compare May 25, 2017 01:32
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 25, 2017

@verult: The following test(s) failed:

Test name Commit Details Rerun command
pull-kubernetes-federation-e2e-gce f9dc2d5 link @k8s-bot pull-kubernetes-federation-e2e-gce test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jingxu97
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97, saad-ali, verult

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 46383, 45645, 45923, 44884, 46294)

@k8s-github-robot k8s-github-robot merged commit c34b359 into kubernetes:master May 26, 2017
@saad-ali saad-ali modified the milestones: v1.6, v1.7 May 26, 2017
@enisoc enisoc added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 30, 2017
k8s-github-robot pushed a commit that referenced this pull request Jun 12, 2017
…-upstream-release-1.6

Automatic merge from submit-queue

Automated cherry pick of #45923

Cherry pick of #45923 on release-1.6.

#45923: Node status updater now deletes the node entry in attach
@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.6" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-github-robot pushed a commit that referenced this pull request Jun 21, 2017
Automatic merge from submit-queue

Node status updater now deletes the node entry in attach updates when node is missing in NodeInformer cache.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.



**What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #42438

**Special notes for your reviewer**: v1.5 version of the fix addressed by PR #45923. This is necessary because NodeLister did not exist prior to 1.6, thus node status updater requires a slightly different node existence check.

**Release note**:

```release-note
NONE
```
@verult verult deleted the cxing/NodeStatusUpdaterFix branch March 24, 2018 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-controller-manager spamming errors in log
10 participants