Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boskos server fail to claim pv after restart/reschedule from the node #4571

Closed
krzyzacy opened this issue Sep 14, 2017 · 22 comments
Closed

Boskos server fail to claim pv after restart/reschedule from the node #4571

krzyzacy opened this issue Sep 14, 2017 · 22 comments
Assignees
Labels
area/boskos Issues or PRs related to code in /boskos

Comments

@krzyzacy
Copy link
Member

krzyzacy commented Sep 14, 2017

/area boskos
/assign

Events:
  FirstSeen	LastSeen	Count	From						SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----						-------------	--------	------		-------
  26m		26m		1	default-scheduler						Normal		Scheduled	Successfully assigned boskos-3859550540-jlgcx to gke-prow-default-pool-42819f20-28z7
  24m		23s		12	kubelet, gke-prow-default-pool-42819f20-28z7			Warning		FailedMount	Unable to mount volumes for pod "boskos-3859550540-jlgcx_test-pods(a9392393-9983-11e7-ad2e-42010a8000c4)": timeout expired waiting for volumes to attach/mount for pod "test-pods"/"boskos-3859550540-jlgcx". list of unattached/unmounted volumes=[boskos-volume]
  24m		23s		12	kubelet, gke-prow-default-pool-42819f20-28z7			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "test-pods"/"boskos-3859550540-jlgcx". list of unattached/unmounted volumes=[boskos-volume]

redeploy fixes it, but still want to fix the actual issue.

@k8s-ci-robot k8s-ci-robot added the area/boskos Issues or PRs related to code in /boskos label Sep 14, 2017
@0xmichalis
Copy link
Contributor

0xmichalis commented Sep 14, 2017

The event is cryptic. @kubernetes/sig-storage-misc how can we debug this kind of issues?

@msau42
Copy link
Member

msau42 commented Sep 14, 2017

You need to look for the volume in kubelet.log

@krzyzacy
Copy link
Member Author

(hummm seems it just happened again cc @jingxu97 (was investigating offline try to find some clue))

@krzyzacy
Copy link
Member Author

@msau42 I think it's something else - this time it took 6h and still not get ready - let's get the statefulset conversion asap

@jingxu97
Copy link
Contributor

jingxu97 commented Sep 15, 2017

From the the log, I see the following issue

  1. Replica_set created a new pod boskos-3859550540-j5cgh
I0914 21:10:07.400347       6 replica_set.go:455] Too few "test-pods"/"boskos-3859550540" replicas, need 1, creating 1
I0914 21:10:07.405480       6 deployment_controller.go:485] Error syncing deployment test-pods/boskos: Operation cannot be fulfilled on deployments.extensions "boskos": the object has been modified; please apply your changes to the latest version and try again
I0914 21:10:07.410757       6 event.go:217] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"test-pods", Name:"boskos-3859550540", UID:"15d8dd6f-9991-11e7-ad2e-42010a8000c4", APIVersion:"extensions", ResourceVersion:"56666044", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: boskos-3859550540-j5cgh
I0914 21:10:07.439993       6 deployment_controller.go:485] Error syncing deployment test-pods/boskos: Operation cannot be fulfilled on deployments.extensions "boskos": the object has been modified; please apply your changes to the latest version and try again
  1. Volume is attached to node gke-prow-default-pool-42819f20-pbpz
I0914 21:10:37.339888       6 reconciler.go:208] Started AttachVolume for volume "kubernetes.io/gce-pd/gke-prow-f9e60086-dyna-pvc-371f91ee-6754-11e7-9a57-42010a800166" to node "gke-prow-default-pool-42819f20-pbpz"
I0914 21:11:08.206682       6 operation_generator.go:290] AttachVolume.Attach succeeded for volume "kubernetes.io/gce-pd/gke-prow-f9e60086-dyna-pvc-371f91ee-6754-11e7-9a57-42010a800166" (spec.Name: "pvc-371f91ee-6754-11e7-9a57-42010a800166") from node "gke-prow-default-pool-42819f20-pbpz".
  1. Some time later, another pod for the replica_set is created, but the previous is never gets deleted.
I0914 22:11:06.827792       6 replica_set.go:455] Too few "test-pods"/"boskos-3859550540" replicas, need 1, creating 1
I0914 22:11:06.835592       6 event.go:217] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"test-pods", Name:"boskos-3859550540", UID:"15d8dd6f-9991-11e7-ad2e-42010a8000c4", APIVersion:"extensions", ResourceVersion:"56666631", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: boskos-3859550540-gkkvd
  1. Attach volume failed because it was still attached by the first pod
I0914 22:11:06.917126       6 reconciler.go:208] Started AttachVolume for volume "kubernetes.io/gce-pd/gke-prow-f9e60086-dyna-pvc-371f91ee-6754-11e7-9a57-42010a800166" to node "gke-prow-default-pool-42819f20-96n1"
E0914 22:11:10.793582       6 gce.go:505] GCE operation failed: googleapi: Error 400: The disk resource 'projects/k8s-prow-builds/zones/us-central1-f/disks/gke-prow-f9e60086-dyna-pvc-371f91ee-6754-11e7-9a57-42010a800166' is already being used by 'projects/k8s-prow-builds/zones/us-central1-f/instances/gke-prow-default-pool-42819f20-pbpz'

Between the first pod and second pod, there are lots of nodes failures error message, not sure what happened to the tests and whether this is normal.
Also it seems like replica_set didn't realize it already has a pod running, and started a second one at some point.

@krzyzacy
Copy link
Member Author

@foxish ^^ more findings from @jingxu97 , any thoughts, can it be a potential replicaset problem? Or that's expected when node is under disk pressure?

  • For this issue now we are living in the wonderful world of statefulset already, thanks @msau42 !

@foxish
Copy link
Contributor

foxish commented Sep 15, 2017

Hmm, my guess for why this happened is - we changed the node controller to not automatically cleanup pods on unhealthy nodes since 1.5. As part of that, cleanup will happen only when the nodes are deleted from the API. So, when the node goes unready/fails, the pods go into an "unknown" state and stay stuck with the mounted volume. The controller knows to create replacements but they cannot mount the volume. The fix would be - detecting an unhealthy node and simply deleting it from the API (which GKE should do automagically).

@saad-ali
Copy link
Member

@foxish that makes sense, the detach code does not kick in unless the pod is in failed or succeeded states or deleted. So if it goes into unknown state the volume will remain attached (the safe behavior).

The fix would be - detecting an unhealthy node and simply deleting it from the API (which GKE should do automagically).

Why didn't that happen in @krzyzacy's case?

@jingxu97
Copy link
Contributor

For replica_set controller, it will not delete the unknown state pod before creating a new pod? How the controller counts the number of pods in the replica_set, it does not count the unknown status pod?

@krzyzacy
Copy link
Member Author

we have gke node-auto-repair on, and that should auto fixing any unhealthy nodes?

@foxish
Copy link
Contributor

foxish commented Sep 15, 2017

For replica_set controller, it will not delete the unknown state pod before creating a new pod? How the controller counts the number of pods in the replica_set, it does not count the unknown status pod?

Yeah, the pod will get the deletionTimeStamp set and all the workload controllers know to ignore such "to-be-deleted" pods when accounting for running pods.
Does the attach-detach controller react to that?

@jingxu97
Copy link
Contributor

We don't detach volume from "to-be-deleted" pod. Only after pod is deleted from API server, controller will detach.

@jingxu97
Copy link
Contributor

@saad-ali Based on @foxish description, I think there is a mismatch between replication_set controller and attach_detach controller.

If a pod is in "unknown" state, replication_set controller will not count this as running pod and start a new one. But attach_detach_controller will not detach until the "unkown" pod is deleted from api server. If no garbage collection happen for this "unknown" pod for a long time, the new pod will not ever get the volume attached to the new node.

But I assume after setting "deletionTimeStamp", the pod should be deleted soon from the API server. Anything missed here?

cc @yujuhong

@saad-ali
Copy link
Member

Attach/detach controller does react to pod deletion events: if the pod's deletionTimeStamp is set and containers are terminated (see https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/volumehelper/volumehelper.go#L103) the attach/detach controller will trigger the detach operation.

However it does not trigger detach if the pod state is "Unknown" because we don't want to detach and potentially corrupt user data, if pod state is unknown. We depend on some outside entity (node repair tool?) stepping in and deleting these pods. Question is if @krzyzacy has node-auto-repair on why didn't it delete the pods?

@jingxu97
Copy link
Contributor

I checked the log again. I think there are a number of cases that the node is not marked as not ready. No record of the pod status. But somehow replica_set controller starts a new pod because the count of running pod is no longer 1. But the new pod cannot attach the volume because it is still attached to different node. Eventually the old pod is deleted, but that's because @krzyzacy manually delete the deployment.

@0xmichalis
Copy link
Contributor

Yeah, PVs and ReplicaSets/Deployments don't play well. @krzyzacy can we close this issue now that boskos moved to a StatefulSet?

@krzyzacy
Copy link
Member Author

yeah I'll close the issue, but we can keep the discussion going :-)

@jingxu97
Copy link
Contributor

@Kargakis But it seems like for some reason replica_set controller tries to create a second pod even the node status does not change to "NotReady". I am not sure the pod status. Might need to check the kubelet log.

@0xmichalis
Copy link
Contributor

The ReplicaSet controller will create a replacement as soon as the previous pod is marked for deletion. RC/RS/D are optimized for availability.

@jingxu97
Copy link
Contributor

I got the more information from kubelet log.

The events happened as follows

  1. Replica_set create pod1, attach volume to node 1
  2. Node 1 is under pressure, eviction manager evict pod1, but there is not event for pod delete from API server. So attach_detach_controller does not know the pod is evicted.
  3. Replica_set create a second pod2, but attach volume failed

the following log from kukbelet during eviction, note this is log is in reversed order

Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: E0915 16:41:22.410891    1387 kubelet_pods.go:920] Failed killing the pod "boskos-3859550540-413kp": failed to "KillContainer" for "boskos" with KillContainerError: "rpc error: code = 2 desc = Error: No such container: 9a6fe5d52189226a97507b6a11cb36034c44d7ad59f5b16051ec3d034e9d292a"
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: W0915 16:41:22.398442    1387 docker_sandbox.go:263] NetworkPlugin kubenet failed on the status hook for pod "boskos-3859550540-413kp_test-pods": Cannot find the network namespace, skipping pod network status for container {"docker" "4fa0cac96a22723c21c5ea0cbf7aed43034d1a8e17c98848a6c247c11f401839"}
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: E0915 16:41:22.394223    1387 remote_runtime.go:272] ContainerStatus "9a6fe5d52189226a97507b6a11cb36034c44d7ad59f5b16051ec3d034e9d292a" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 9a6fe5d52189226a97507b6a11cb36034c44d7ad59f5b16051ec3d034e9d292a
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: I0915 16:41:22.392711    1387 kubelet_pods.go:917] Killing unwanted pod "boskos-3859550540-413kp"
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: W0915 16:41:22.215676    1387 pod_container_deletor.go:77] Container "4fa0cac96a22723c21c5ea0cbf7aed43034d1a8e17c98848a6c247c11f401839" not found in pod's containers
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: I0915 16:41:22.215620    1387 kubelet.go:1842] SyncLoop (PLEG): "boskos-3859550540-413kp_test-pods(8f070f7d-9a30-11e7-ad2e-42010a8000c4)", event: &pleg.PodLifecycleEvent{ID:"8f070f7d-9a30-11e7-ad2e-42010a8000c4", Type:"ContainerDied", Data:"4fa0cac96a22723c21c5ea0cbf7aed43034d1a8e17c98848a6c247c11f401839"}
Sep 15 16:41:22 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: W0915 16:41:22.214899    1387 docker_sandbox.go:263] NetworkPlugin kubenet failed on the status hook for pod "boskos-3859550540-413kp_test-pods": Cannot find the network namespace, skipping pod network status for container {"docker" "4fa0cac96a22723c21c5ea0cbf7aed43034d1a8e17c98848a6c247c11f401839"}
Sep 15 16:41:21 gke-prow-default-pool-42819f20-z1v4 kubelet[1387]: I0915 16:41:21.258784    1387 eviction_manager.go:362] eviction manager: pod boskos-3859550540-413kp_test-pods(8f070f7d-9a30-11e7-ad2e-42010a8000c4) evicted successfully

So the problem is that after pod is evicted, the pod is not deleted from API server. (I think pos is in evicted state) attach_detach_controller does not receive pod delete event from API server so it will not detach
@dashpole , should pod is deleted from API server or not when it is evicted?

@dashpole
Copy link
Contributor

It should not be deleted from the api server when it is evicted. This allows introspection of the pod after it is evicted.

I believe the correct behavior here is for the attack_detach_controller to identify that the Pod.Status.Phase == Failed or Succeeded. The only part I am not sure of is if there is a requirement for the volume to be unmounted before it is detached. In that case it may be trickier for the attach_detach_controller to identify that the volume has bee unmounted.

@jingxu97
Copy link
Contributor

jingxu97 commented Sep 18, 2017

I just recalled that we have already changed attach_detach controller to handle terminated pods. Please see kubernetes/kubernetes#45286. The PR was merged in May 11, and is included after 1.7.0 release.
So this test infra has kubernetes of 1.6.9 which does not include the fix. Because this is a breaking change, so it was not backported. But new version should not have this issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/boskos Issues or PRs related to code in /boskos
Projects
None yet
Development

No branches or pull requests

8 participants