Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide different running modes for node-driver-registrar, add a run mode to detect if the kubelet plugin registration failed #152

Merged

Conversation

mauriciopoppe
Copy link
Member

@mauriciopoppe mauriciopoppe commented Jun 23, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:

Adds running modes, there's a new kubelet-plugin-exec running mode that checks if the kubelet plugin registration succeeded, please check the README.md for an example of how it's used.

Which issue(s) this PR fixes:

Workaround for #143, unfortunately it isn't fixed yet.

Does this PR introduce a user-facing change?:

New running modes, the kubelet-registration-probe mode checks if node-driver-registrar kubelet plugin registration succeeded.

/cc @jingxu97 @msau42
cc @lizhuqi @andyzhangx

@k8s-ci-robot k8s-ci-robot requested review from jingxu97 and msau42 June 23, 2021 06:15
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 23, 2021
Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
thanks, that's really a nice and clean fix.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2021
// get a callback through GetInfo, as a workaround if we don't get a callback within
// the next 10 seconds we'll restart
go func() {
err := wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could adjust the timeout based on how soon we get a callback, the timing in my dev cluster was:

  • linux: around 2s
  • windows: around 1.7s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think its's preferable to use a readiness probe so that the behavior and thresholds can be configurable by users.

However, http probes don't work well if the pod needs to run on host network (which most csi node plugins have to do). An exec probe is better in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know this, yes another idea that I posted on #143 was to use probes however I didn't know what to check on the probe, on successful registration we could touch a file that could be checked for existence with the exec probe but at the same time cleaning that file could generate more problems too, I could connect to the api-server to ask if this component is registered too but this would probably need RBAC rules and so on.

Another idea was to use a flag in the CLI for the timeout, because the state checked is only local I believe that checking every 100ms would be fine, there are two flags to control timeouts (one deprecated) and we could have another one for the kubelet registration timeout:

	connectionTimeout       = flag.Duration("connection-timeout", 0, "The --connection-timeout flag is deprecated")
	operationTimeout        = flag.Duration("timeout", time.Second, "Timeout for waiting for communication with driver")

	// new flag
	kubeletRegistrationTimeout        = flag.Duration("kubelet-registration-timeout", 5*time.Second, "Timeout for waiting kubelet registration")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the file be put into an emptydir directory? That will get cleaned up when the pod gets deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to understand, comparing this current implementation and "touch file" approach, what are pros and cons?
this one seems simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the current implementation is simpler and self contained (no additional setup on the yaml), however as Michelle said it lacks a way to control the timeout which we could do with a cli flag too.

For the touch file approach in a experiment I deployed this component (that timed out after 10s) together the GCE PD CSI driver and if I remember correctly only this container was restarted, I'm going to deploy the version that crashed the container again and see if the pods get restarted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's 3 things: timeout, polling frequency, and whether or not you even want to turn on the behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that only the node-driver-registrar container is restarted:

image
image

Also the docs say:

Note: A container crashing does not remove a Pod from a node. The data in an emptyDir volume is safe across container crashes.

The touch file approach would be something like this:

func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {
	klog.Infof("Received GetInfo call: %+v", req)
	// TODO: touch file /kubelet-registration-ack/ack
	return &registerapi.PluginInfo{
		Type:              registerapi.CSIPlugin,
		Name:              e.driverName,
    }
}

func main() {
   // TODO: add trap to delete the temp file
}
        - name: csi-driver-registrar
          image: gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
          imagePullPolicy: Always
          args:
            - --v=5
            - --csi-address=unix://C:\\csi\\csi.sock
            - --kubelet-registration-path=C:\\var\\lib\\kubelet\\plugins\\pd.csi.storage.gke.io\\csi.sock
          env:
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: kubelet-registration-ack-dir
              mountPath: /kubelet-registration-ack
          livenessProbe:
            exec:
              command:
              - cat
              - /kubelet-registration-ack/ack
            initialDelaySeconds: 1
            periodSeconds: 1

A problem that I see is if node-driver-registrar gets forcefully restarted without cleaning the temp file, the next time it comes up it could think that it's alive because the file is there but it could be possible that it didn't perform the registration process correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete the file on startup?

One more challenge is that the node-driver-registrar container doesn't have a shell. So it may need to be a command line option to do the readiness check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach sounds good, I thought about creating it with the name of the driver e.g.

          args:
            - --v=5
            - --csi-address=unix://C:\\csi\\csi.sock
            - --kubelet-registration-ack-path=C:\\registration\\pd.csi.storage.gke.io-kubelet-registration-ack
            - --kubelet-registration-path=C:\\var\\lib\\kubelet\\plugins\\pd.csi.storage.gke.io\\csi.sock
          livenessProbe:
            exec:
              command:
              - /node-driver-registrar --kubelet-registration-check-ack=C:\\registration\\pd.csi.storage.gke.io-kubelet-registration-ack
            initialDelaySeconds: 1
            periodSeconds: 1

There would be 2 additional flags:

  • --kubelet-registration-ack-path that tells this component where to write the lock file once it gets a callback from the kubelet
  • --kubelet-registration-check-ack with the same value as kubelet-registration-ack-path

The name of driver could be fetched from the driver itself however it makes it difficult during a forced restart when we have to delete the file on startup because we don't know which file to delete until the gRPC request to driver is made, that's why I opted for a 2 flag approach.

@lizhuqi
Copy link

lizhuqi commented Jun 23, 2021

/lgtm

@k8s-ci-robot
Copy link
Contributor

@lizhuqi: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -63,6 +65,19 @@ func nodeRegister(csiDriverName, httpEndpoint string) {
// Registers kubelet plugin watcher api.
registerapi.RegisterRegistrationServer(grpcServer, registrar)

// Sometimes on windows after registration with the kubelet plugin we don't
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open up a separate bug to investigate why we don't get the callback sometimes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have #143, I marked this PR as a workaround instead (it's no longer going to fix #143)

// get a callback through GetInfo, as a workaround if we don't get a callback within
// the next 10 seconds we'll restart
go func() {
err := wait.PollImmediate(100*time.Millisecond, 10*time.Second, func() (bool, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think its's preferable to use a readiness probe so that the behavior and thresholds can be configurable by users.

However, http probes don't work well if the pod needs to run on host network (which most csi node plugins have to do). An exec probe is better in that case.

mauriciopoppe added a commit to mauriciopoppe/node-driver-registrar that referenced this pull request Jun 24, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2021
@mauriciopoppe
Copy link
Member Author

Without any flags i.e. as it is right now:

❯ k -n gce-pd-csi-driver get pods
NAME                                    READY   STATUS    RESTARTS   AGE
csi-gce-pd-controller-6c97cbcdf-5zcnv   5/5     Running   0          14m
csi-gce-pd-node-txc5t                   2/2     Running   0          14m
csi-gce-pd-node-win-n6gvj               2/2     Running   0          14m

With the liveness probe and without the file creation flag in the linux container, in this case because the file is never created then the probe will always fail

      containers:
        - name: csi-driver-registrar
          image: gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
          imagePullPolicy: Always
          args:
            - "--v=5"
            - "--csi-address=/csi/csi.sock"
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock"
          volumeMounts:
            - name: kubelet-registration-ack
              mountPath: /registration-ack
          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - --kubelet-registration-check-ack=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack
            initialDelaySeconds: 2
      volumes:
        - name: kubelet-registration-ack
          emptyDir: {}
k -n gce-pd-csi-driver get pods
NAME                                    READY   STATUS    RESTARTS      AGE
csi-gce-pd-controller-6c97cbcdf-f4tm8   5/5     Running   0             52s
csi-gce-pd-node-gdbh5                   2/2     Running   1 (21s ago)   52s
csi-gce-pd-node-win-4nm7b               2/2     Running   0             51s

k -n gce-pd-csi-driver describe pod csi-gce-pd-node-gdbh5
Name:                 csi-gce-pd-node-gdbh5
Namespace:            gce-pd-csi-driver
Priority:             900001000
Priority Class Name:  csi-gce-pd-node
Node:                 e2e-test-mauriciopoppe-minion-group-tpd3/10.40.0.11
Start Time:           Thu, 24 Jun 2021 06:41:49 +0000
Labels:               app=gcp-compute-persistent-disk-csi-driver
                      controller-revision-hash=79cff4bc5d
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   10.40.0.11
IPs:
  IP:           10.40.0.11
Controlled By:  DaemonSet/csi-gce-pd-node
Containers:
  csi-driver-registrar:
    Container ID:  containerd://73e3654e9c8fb8eed4dd4ee8288ff3340ebd68c98e83b56aef40888b8825bb89
    Image:         gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
    Image ID:      gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar@sha256:9a8bc8670baad648c72c3275c5534d773b34569aa6bd9431ed9689269cbfd595
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=/csi/csi.sock
      --kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock
    State:          Running
      Started:      Thu, 24 Jun 2021 06:43:49 +0000
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 24 Jun 2021 06:43:19 +0000
      Finished:     Thu, 24 Jun 2021 06:43:49 +0000
    Ready:          True
    Restart Count:  4
    Liveness:       exec [/csi-node-driver-registrar --kubelet-registration-check-ack=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack] delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /registration-ack from kubelet-registration-ack (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fthwb (ro)
  gce-pd-driver:
    Container ID:  containerd://ed2cd265b86786636019d8d76c8e98b28d8636ff90e805f3ca35978ebc0f10df
    Image:         gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest
    Image ID:      gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver@sha256:b14b1056f048b96c8ed67efceb883a5bb53c8a7da278b700564bdbb3c0a310f3
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --endpoint=unix:/csi/csi.sock
      --run-controller-service=false
    State:          Running
      Started:      Thu, 24 Jun 2021 06:41:51 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /etc/udev from udev-rules-etc (rw)
      /lib/udev from udev-rules-lib (rw)
      /run/udev from udev-socket (rw)
      /sys from sys (rw)
      /var/lib/kubelet from kubelet-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fthwb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  kubelet-registration-ack:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/pd.csi.storage.gke.io/
    HostPathType:  DirectoryOrCreate
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  udev-rules-etc:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/udev
    HostPathType:  Directory
  udev-rules-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/udev
    HostPathType:  Directory
  udev-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/udev
    HostPathType:  Directory
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  kube-api-access-fthwb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m5s                default-scheduler  Successfully assigned gce-pd-csi-driver/csi-gce-pd-node-gdbh5 to e2e-test-mauriciopoppe-minion-group-tpd3
  Normal   Pulled     2m4s                kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest" in 238.922708ms
  Normal   Pulled     2m4s                kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary" in 295.083092ms
  Normal   Created    2m4s                kubelet            Created container gce-pd-driver
  Normal   Pulling    2m4s                kubelet            Pulling image "gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest"
  Normal   Started    2m3s                kubelet            Started container gce-pd-driver
  Warning  Unhealthy  115s                kubelet            Liveness probe failed: E0624 06:41:59.068896      14 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Warning  Unhealthy  105s                kubelet            Liveness probe failed: E0624 06:42:09.059418      23 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Normal   Pulled     95s                 kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary" in 191.122411ms
  Warning  Unhealthy  95s                 kubelet            Liveness probe failed: E0624 06:42:19.089269      33 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Warning  Unhealthy  85s                 kubelet            Liveness probe failed: E0624 06:42:29.058963      14 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Warning  Unhealthy  75s                 kubelet            Liveness probe failed: E0624 06:42:39.070771      26 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Normal   Killing    65s (x2 over 95s)   kubelet            Container csi-driver-registrar failed liveness probe, will be restarted
  Normal   Started    65s (x3 over 2m4s)  kubelet            Started container csi-driver-registrar
  Normal   Created    65s (x3 over 2m4s)  kubelet            Created container csi-driver-registrar
  Normal   Pulling    65s (x3 over 2m5s)  kubelet            Pulling image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary"
  Warning  Unhealthy  65s                 kubelet            Liveness probe failed: E0624 06:42:49.056886      36 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet
  Normal   Pulled     65s                 kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary" in 214.35227ms
  Warning  Unhealthy  55s                 kubelet            Liveness probe failed: E0624 06:42:59.058246      13 main.go:128] path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack doesn't exist, the kubelet plugin registration hasn't succeeded yet


With both flags set:

      containers:
        - name: csi-driver-registrar
          image: gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
          imagePullPolicy: Always
          args:
            - "--v=5"
            - "--csi-address=/csi/csi.sock"
            - "--kubelet-registration-ack-path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack"
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock"
          volumeMounts:
            - name: kubelet-registration-ack
              mountPath: /registration-ack
          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - --kubelet-registration-check-ack=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack
            initialDelaySeconds: 2
      volumes:
        - name: kubelet-registration-ack
          emptyDir: {}
❯ k -n gce-pd-csi-driver get pods
NAME                                    READY   STATUS    RESTARTS   AGE
csi-gce-pd-controller-6c97cbcdf-cxpqd   5/5     Running   0          94s
csi-gce-pd-node-pchkp                   2/2     Running   0          94s
csi-gce-pd-node-win-tcxxz               2/2     Running   0          94s

❯ k -n gce-pd-csi-driver describe pod csi-gce-pd-node-779hn
Name:                 csi-gce-pd-node-779hn
Namespace:            gce-pd-csi-driver
Priority:             900001000
Priority Class Name:  csi-gce-pd-node
Node:                 e2e-test-mauriciopoppe-master/10.40.0.7
Start Time:           Thu, 24 Jun 2021 06:46:46 +0000
Labels:               app=gcp-compute-persistent-disk-csi-driver
                      controller-revision-hash=8f4f6c76
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   10.40.0.7
IPs:
  IP:           10.40.0.7
Controlled By:  DaemonSet/csi-gce-pd-node
Containers:
  csi-driver-registrar:
    Container ID:  containerd://d9b304d677b2e7d072381a190b3b21fc6db4c0ecc81a331814b2de48af66a52d
    Image:         gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
    Image ID:      gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar@sha256:9a8bc8670baad648c72c3275c5534d773b34569aa6bd9431ed9689269cbfd595
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=/csi/csi.sock
      --kubelet-registration-ack-path=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack
      --kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock
    State:          Running
      Started:      Thu, 24 Jun 2021 06:46:47 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/csi-node-driver-registrar --kubelet-registration-check-ack=/registration-ack/pd.csi.storage.gke.io-kubelet-registration-ack] delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /registration-ack from kubelet-registration-ack (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9dkhd (ro)
  gce-pd-driver:
    Container ID:  containerd://e5d48f5f87d67576447b2a54b3fbb2d470712360ba499d3d12d3ac6a4843a315
    Image:         gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest
    Image ID:      gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver@sha256:b14b1056f048b96c8ed67efceb883a5bb53c8a7da278b700564bdbb3c0a310f3
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --endpoint=unix:/csi/csi.sock
      --run-controller-service=false
    State:          Running
      Started:      Thu, 24 Jun 2021 06:46:48 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /etc/udev from udev-rules-etc (rw)
      /lib/udev from udev-rules-lib (rw)
      /run/udev from udev-socket (rw)
      /sys from sys (rw)
      /var/lib/kubelet from kubelet-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9dkhd (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  kubelet-registration-ack:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/pd.csi.storage.gke.io/
    HostPathType:  DirectoryOrCreate
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  udev-rules-etc:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/udev
    HostPathType:  Directory
  udev-rules-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/udev
    HostPathType:  Directory
  udev-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /run/udev
    HostPathType:  Directory
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  kube-api-access-9dkhd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  69s   default-scheduler  Successfully assigned gce-pd-csi-driver/csi-gce-pd-node-779hn to e2e-test-mauriciopoppe-master
  Normal  Pulling    68s   kubelet            Pulling image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary"
  Normal  Pulled     68s   kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary" in 232.122353ms
  Normal  Created    68s   kubelet            Created container csi-driver-registrar
  Normal  Started    68s   kubelet            Started container csi-driver-registrar
  Normal  Pulling    68s   kubelet            Pulling image "gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest"
  Normal  Pulled     67s   kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/gcp-compute-persistent-disk-csi-driver:latest" in 211.920881ms
  Normal  Created    67s   kubelet            Created container gce-pd-driver
  Normal  Started    67s   kubelet            Started container gce-pd-driver

mauriciopoppe added a commit to mauriciopoppe/node-driver-registrar that referenced this pull request Jun 24, 2021
@mauriciopoppe mauriciopoppe force-pushed the restart-if-not-connected branch from dda7dcd to 79f0aac Compare June 24, 2021 06:54
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 24, 2021

// kubelet registration succeed flags
kubeletRegistrationAckPath = flag.String("kubelet-registration-ack-path", "", "If set, a temp file with this name will be created after the kubelet registration process succeeds.")
kubeletRegistrationCheckAck = flag.String("kubelet-registration-check-ack", "", "Checks that the kubelet plugin registration ack file exists, if set it must be the same value as kubelet-registration-ack-path.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could only use one value kubeletRegistrationAckPath, if kubeletRegistrationAckPath is set, then always check where kubelet plugin registration succeeds or not, that would be simpler to configure.

Copy link
Member Author

@mauriciopoppe mauriciopoppe Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that was my initial take but I found some problems with the action to perform, so I'm checking whether a lock file exists or not (per driver) and to do that this flag would need to have a value so the probe could be:

          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - "--kubelet-registration-ack-path=my-driver-lock-file"
            initialDelaySeconds: 2

The container needs to create this file on startup:

      containers:
        - name: csi-driver-registrar
          image: gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
          args:
            - "--kubelet-registration-ack-path=my-driver-lock-file
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock"

The logic to use it would be:

  • if it's the only flag set then we'll use it as a probe
  • if it's together with --kubelet-registration-path then it's not a probe and we should use it to create the file

The fact that there are two behaviors for the same flag felt kinda weird but I could try this approach too

cc @msau42 @jingxu97

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the flags usage, now there's one flag to set the lock file path and another one to set the mode:

      containers:
        - name: csi-driver-registrar
          image: gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary
          args:
            - "--kubelet-registration-succeeded-lock-file=my-driver-lock-file"
            - "--kubelet-registration-path=/var/lib/kubelet/plugins/pd.csi.storage.gke.io/csi.sock"
          livenessProbe:
            exec:
              command:
              - /csi-node-driver-registrar
              - "--kubelet-registration-succeeded-lock-file=my-driver-lock-file"
              - "--kubelet-registration-succeeded-mode=probe"
            initialDelaySeconds: 2

The logic to use them is:

  • to enable checking the registration process set --kubelet-registration-succeeded-lock-file to the path of a file that will be created in an emptydir, this value must be set in both the image args and the liveness probe exec binary args with the same value
  • in addition set --kubelet-registration-succeeded-mode=probe in the liveness probe

cc @msau42 @jingxu97

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the lock file arg? Can we have a predefined file name based on the driver name?

Copy link
Member Author

@mauriciopoppe mauriciopoppe Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the driver name is available a few lines after the fix and it's fetched through an RPC call, I could go through this path once and cache it in memory for later calls of the probe

csiDriverName, err := csirpc.GetDriverName(ctx, csiConn)

pkg/util/util_linux.go Outdated Show resolved Hide resolved
mauriciopoppe added a commit to mauriciopoppe/node-driver-registrar that referenced this pull request Jun 24, 2021
@mauriciopoppe mauriciopoppe force-pushed the restart-if-not-connected branch from 79f0aac to d72b499 Compare June 24, 2021 17:21
@mauriciopoppe mauriciopoppe force-pushed the restart-if-not-connected branch 3 times, most recently from b64889e to 4c0a68c Compare July 23, 2021 18:19
@mauriciopoppe mauriciopoppe changed the title Restart node-driver-registrar if it didn't receive a callback from the kubelet Provide different running modes for node-driver-registrar, add a run mode to detect if the kubelet plugin registration failed Jul 23, 2021
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
cmd/csi-node-driver-registrar/main.go Outdated Show resolved Hide resolved
pkg/util/util_linux.go Outdated Show resolved Hide resolved
cmd/csi-node-driver-registrar/main.go Outdated Show resolved Hide resolved
@mauriciopoppe mauriciopoppe force-pushed the restart-if-not-connected branch from 4c0a68c to 58bd237 Compare July 29, 2021 00:23
@mauriciopoppe
Copy link
Member Author

Thanks for the feedback @msau42, now the registration file is always created when the kubelet plugin registration succeeds, however, doing a health check with a probe is optional. I've tested the new implementation in Windows nodes:

These are the logs when node-driver-registrar is started without the code that creates the registration file (I commented out the code to test it):

 Normal   Created    18s (x3 over 79s)  kubelet            Created container csi-driver-registrar
  Normal   Pulled     18s                kubelet            Successfully pulled image "gcr.io/mauriciopoppe-gke-dev/csi-node-driver-registrar:canary" in 283.7998ms
  Normal   Started    17s (x3 over 78s)  kubelet            Started container csi-driver-registrar
  Warning  Unhealthy  9s (x7 over 70s)   kubelet            Liveness probe failed: Kubelet plugin registration hasn't succeeded yet, file=C:\var\lib\kubelet\plugins\pd.csi.storage.gke.io\registration doesn't exist.

Kubelet is restarting the component as expected:

image

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
fmt.Println(os.Args[0], version)
return
// set after we made sure that *kubeletRegistrationPath exists
kubeletRegistrationPathDir := filepath.Dir(*kubeletRegistrationPath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the path to the file? Should the argument be the path to the directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using different values for the same arg might be a bit confusing and that's why the docs ask the user to use the same value as the one used in the container args, I thought it'd be easier to just copy & paste the same value but you're right that we actually need the path to the directory (in the implementation this is computed with filepath.Dir).

We could use the same arg with the path to the directory as the value with additional documentation too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I forgot that we're reusing the same argument. Then I think this is fine.

if modeIsKubeletRegistrationProbe() {
lockfileExists, err := util.DoesFileExist(registrationProbePath)
if err != nil {
fmt.Printf("Failed to check if registration path exists, registrationProbePath=%s err=%v", registrationProbePath, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

klog.Fatalf should work here. Same below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used klog it'd print the golang debug line in the pod events, something like:

liveness probe failed: EDDMM date main.go:###] Failed to ...

I thought that just by using fmt.Printf I could get only the message e.g.

liveness probe failed: Failed to ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have consistency with the other logs we do. Having timestamp information is useful for debugging.

@msau42
Copy link
Collaborator

msau42 commented Aug 2, 2021

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, mauriciopoppe, msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants