CSI: internal plugin errors aren't exposed to operator #7424

tgross · 2020-03-23T12:38:50Z

If a plugin has an internal error, we log a gRPC error code but that's the only information that we can get according to the CSI specification. An example I encountered was when I stopped a job but the EC2 IAM instance role did not have DetachVolume permissions; the job stopped but the EBS volume was still attached to the EC2 instance.

When the ControllerUnpublishVolume was called, the client logs show the following:

2020-03-22T14:24:57.238Z [WARN] client.csi_client: finished client
unary call: plugin.name=aws-ebs0 plugin.type=controller
grpc.code=Internal duration=161.987586ms
grpc.service=csi.v1.Controller grpc.method=ControllerUnpublishVolume

But if we look at the controller plugin's alloc logs they show the real problem:

E0322 14:24:57.237631 1 driver.go:109] GRPC error: rpc error: code =
Internal desc = Could not detach volume "vol-09b7801ac621c83f5" from
node "i-0aa3ac7f05e936deb": could not detach volume
"vol-09b7801ac621c83f5" from node "i-0aa3ac7f05e936deb":
UnauthorizedOperation: You are not authorized to perform this
operation. Encoded authorization failure message: [REDACTED]

(logs redacted and line-broken for readability)

There's two things to fix here:

we should retry plugin RPCs (Automatically Retry Retriable RPC failures to CSI Plugins #6863) and then make much louder error messages when they fail.
we should document that some of the internal behaviors of plugins aren't visible to Nomad and that operators should check the plugin logs when debugging plugin failures.

The text was updated successfully, but these errors were encountered:

tgross · 2020-03-30T19:07:09Z

The work done in #7549 for #6863 shows the places where we could hook better error handling.

In the CSI RPCs that can be retries, we only do so for timeout, codes.Unavailable and codes.ResourceExhausted are retried; all other errors are fatal. We can pull some information out of the error codes by digging into the documentation on the individual calls. For example, the errors for NodeStageVolume

langmartin · 2020-03-30T20:13:38Z

The intention of the CSIVolume ResourceExhausted field is to capture the time of the of the last error on the volume, so that the scheduler can back off or wait for a resource to be freed, we might want to incorporate that as part of this ticket

tgross · 2020-05-13T20:59:29Z

An example where we could be doing better without having to radically rework how we interface with the RPCs is #7931 (comment), where we can't communicate with the plugin because of file permissions on the socket.

tgross · 2020-05-15T18:17:09Z

I spent the morning digging through the hooks we can get via gRPC and actually there's nothing else we can do here with the messages we're getting back. What we can do is to make sure that the server RPCs are wrapping messages we get back from the clients nicely so that the CLI user gets better feedback when it's available. And we can give some direction to check the allocation logs.

Also note that since we opened this issue we implemented #7547, which threads some of these client-side messages up through the node events and that actually makes this kind of issue quite a bit better:

Recent Events:
Time                       Type           Description
2020-05-15T14:11:21-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: rpc error: code = Internal desc = Could not attach volume "vol-03150160e19dbb5dd" to node "i-0e096b3a1fd9d7f6c": could not attach volume "vol-03150160e19dbb5dd" to node "i-0e096b3a1fd9d7f6c": UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: REDACTED
        status code: 403, request id: 7da740ea-ee49-4170-a8c2-ef00be8886f1
2020-05-15T14:11:20-04:00  Received       Task received by client

tgross · 2020-05-15T19:53:23Z

Will be closed by #7984

github-actions · 2022-11-07T02:32:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/docs Documentation issues and enhancements theme/storage labels Mar 23, 2020

tgross added this to the 0.11.0 milestone Mar 23, 2020

tgross added the theme/observability label Mar 23, 2020

tgross mentioned this issue Mar 30, 2020

csi: add grpc retries to client controller RPCs #7549

Merged

tgross removed this from the 0.11.0 milestone Mar 30, 2020

This was referenced Mar 31, 2020

CSI: failed to setup alloc: pre-run hook "csi_hook" #7568

Closed

csi: unpublish workflow ID mismatches #7626

Closed

csi: unpublish workflow ID mismatches #7628

Closed

tgross added this to the 0.11.1 milestone Apr 9, 2020

tgross modified the milestones: 0.11.1, 0.11.2 Apr 22, 2020

tgross mentioned this issue Apr 27, 2020

Error registering csi volume - Azure Disk #7812

Closed

tgross modified the milestones: 0.11.2, 0.11.3 May 13, 2020

tgross mentioned this issue May 13, 2020

Nomad is unable to create CSI plugin due to being unable to probe the CSI driver #7931

Closed

tgross mentioned this issue May 14, 2020

csi: use a blocking initial connection with timeout #7965

Merged

tgross mentioned this issue May 15, 2020

csi: improve plugin error messages and volume validation #7984

Merged

tgross closed this as completed in #7984 May 18, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: internal plugin errors aren't exposed to operator #7424

CSI: internal plugin errors aren't exposed to operator #7424

tgross commented Mar 23, 2020

tgross commented Mar 30, 2020

langmartin commented Mar 30, 2020

tgross commented May 13, 2020

tgross commented May 15, 2020

tgross commented May 15, 2020

github-actions bot commented Nov 7, 2022

CSI: internal plugin errors aren't exposed to operator #7424

CSI: internal plugin errors aren't exposed to operator #7424

Comments

tgross commented Mar 23, 2020

tgross commented Mar 30, 2020

langmartin commented Mar 30, 2020

tgross commented May 13, 2020

tgross commented May 15, 2020

tgross commented May 15, 2020

github-actions bot commented Nov 7, 2022