-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dangling attach errors #55491
Fix dangling attach errors #55491
Conversation
/sig aws |
cc @kubernetes/sig-storage-pr-reviews |
pkg/volume/aws_ebs/attacher.go
Outdated
// Volume is already detached from node. | ||
glog.Infof("detach operation was successful. volume %q is already detached from node %q.", volumeID, nodeName) | ||
return nil | ||
// it means either node has been deleted or volume is not attached to node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DiskIsAttached will return not attached without error only if cloudprovider.InstanceNotFound returns true. If the node is deleted from cloudprovider, doesn't mean volume should be automatically detached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. in AWS even nodes that are shutdown/stopped are removed from node list and volumes are not detached from such nodes.
83ddecc
to
e9944f8
Compare
/test pull-kubernetes-bazel-test |
e9944f8
to
f9a876a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have couple of nits, however the PR looks good to me in general.
@@ -1718,6 +1718,10 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, | |||
if !alreadyAttached { | |||
available, err := c.checkIfAvailable(disk, "attaching", awsInstance.awsID) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why empty line?
pkg/volume/util/error.go
Outdated
@@ -0,0 +1,41 @@ | |||
/* | |||
Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrong year
f9a876a
to
a65c488
Compare
@jingxu97 ping again on this. let me know what you think? |
@jsafrane can you ptal, I addressed your nits. |
/assign |
/lgtm |
/lgtm |
@deads2k, one tiny approval please, we had to adopt an unit test in pkg/admission to a new interface. |
/approve |
I understand the main changes are for AWS. Will this change affect other volume plugins? |
@jingxu97 not until they also start implementing |
@@ -1766,12 +1769,41 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, | |||
} | |||
|
|||
// DetachDisk implements Volumes.DetachDisk | |||
func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName) (string, error) { | |||
func (c *Cloud) DetachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, nodeExists bool) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not very clear for me to have nodeExists here. The function is to Detach disk from a node, I feel the logic of checking node exist or not should just inside of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it can be certainly changed. will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main reason this additional boolean field has been introduced is because - DiskIsAttached
check returns false
(as in disk is detached) incorrectly when node is shutdown and volume is still attached to the node. I am trying to fix that behaviour here.
The reason DiskIsAsAttached
returns false
even though Disk is still attached to a stopped node is because, in AWS a stopped node is not returned when fetching instance list, so AWS cloudprovider throws "Node Not Found error", whereas node is still there and volumes are still attached to it.
I am trying to fix this, without changing too much of internal of AWS cloudprovider's code. DiskIsAttached
appears to be a standard function that is being used by all volume types. I am not 100% sure, if it makes sense to roll this into DetachDisk
just for aws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main purpose of this fix is that when reconciler tries to attach a disk to a node, but somehow the disk is already attached to another node (e.g., attaching operation timeout but eventually attached).
Here it seems like you are also trying to fix another issue for DiskIsAttached. Could you make a separate PR for that so we can discuss more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is correct. But I think both are related. We couldn't have fixed detaching volumes from stopped nodes if We didn't had Dangling volume detection code.
If you can explain what is bothering you with the fix, may be I will try to workaround or address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like DiskIsAsAttached() is not implemented correctly for AWS when node could not be found by cloud provider, so I feel instead of adding the bool value, we should fix that function instead.
Dangling volume detection is for trigger detach if volume is attached to a different node. But detaching volume from a stopped node does not rely on this detection? When a node is stopped, and removed from the cloud provider, reconciler will trigger detach, but detach will do nothing since it though the volume is already detached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that DiskIsAsAttached
does not do return correct value. But fixing it alone will not fix Detach from stopped nodes. For example, even if we called DetachDisk
then again https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1775 code will make DetachDisk
return again without doing actual detach.
So at minimum I think we need to fix both DiskIsAttached
as well as DetachDisk
to detach disks correctly from stopped nodes.
But detaching volume from a stopped node does not rely on this detection? When a node is stopped, and removed from the cloud provider, reconciler will trigger detach, but detach will do nothing since it though the volume is already detached.
That is correct, but I was thinking also about the case of node getting stopped and c-m restarted or active master switched in HA environment. Both of those cases will cause node information to be lost and hence detach may not be attempted. So there are like 2 bugs present here. :(
I am thinking lets fix both DetachDisk
and DiskIsAttached
functions, so as they can correctly handle stopped instances. This PR sort of fixes DetachDisk
but leaves DiskIsAttached
broken... I will fix that too and update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer, I can fix DetachDisk
and DiskIsAttached
in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having two PRs is better so it is easier to explain what we are trying to fix.
a65c488
to
884e917
Compare
884e917
to
0aaf189
Compare
Detach volumes from shutdown nodes and ensure that dangling volumes are handled correctly in AWS
0aaf189
to
5297c14
Compare
/lgtm |
/approve Some nits that are more suggestions |
@@ -1717,6 +1717,9 @@ func (c *Cloud) AttachDisk(diskName KubernetesVolumeID, nodeName types.NodeName, | |||
|
|||
if !alreadyAttached { | |||
available, err := c.checkIfAvailable(disk, "attaching", awsInstance.awsID) | |||
if err != nil { | |||
glog.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe glog.Errorf("error checking if volume available: %v", err)
|
||
// This error on attach indicates volume is attached to a different node | ||
// than we expected. | ||
type DanglingAttachError struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit: AlreadyAttachedError (?)
@@ -267,6 +267,18 @@ func (og *operationGenerator) GenerateAttachVolumeFunc( | |||
volumeToAttach.VolumeSpec, volumeToAttach.NodeName) | |||
|
|||
if attachErr != nil { | |||
if derr, ok := attachErr.(*util.DanglingAttachError); ok { | |||
addErr := actualStateOfWorld.MarkVolumeAsAttached( | |||
v1.UniqueVolumeName(""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the empty volume name going to cause a problem... can we have a comment as to why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah - the if volume spec is specified in MarkVolumeAsAttached
then unique name is picked from volume spec rather than from first parameter. I am just filling out first parameter to satisfy function contract.
@@ -267,6 +267,18 @@ func (og *operationGenerator) GenerateAttachVolumeFunc( | |||
volumeToAttach.VolumeSpec, volumeToAttach.NodeName) | |||
|
|||
if attachErr != nil { | |||
if derr, ok := attachErr.(*util.DanglingAttachError); ok { | |||
addErr := actualStateOfWorld.MarkVolumeAsAttached( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we glog.Info
(or warning) when we're doing this ... it's unusual enough that we want to know in the logs I think
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, gnufied, jsafrane, justinsb Associated issue: 52573 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Automatic merge from submit-queue (batch tested with PRs 55392, 55491, 51914, 55831, 55836). If you want to cherry-pick this change to another branch, please follow the instructions here. |
It is merged already. But could you address @justinsb's comments in another PR? I think logging when this error happens is important. |
@jingxu97 yes I agree. I will do that in a follow up commit shortly. |
Detach volumes from shutdown nodes and ensure that
dangling volumes are handled correctly in AWS
Fixes #52573