Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: internal plugin errors aren't exposed to operator #7424

Closed
tgross opened this issue Mar 23, 2020 · 6 comments · Fixed by #7984
Closed

CSI: internal plugin errors aren't exposed to operator #7424

tgross opened this issue Mar 23, 2020 · 6 comments · Fixed by #7984
Labels
theme/docs Documentation issues and enhancements theme/observability theme/storage
Milestone

Comments

@tgross
Copy link
Member

tgross commented Mar 23, 2020

If a plugin has an internal error, we log a gRPC error code but that's the only information that we can get according to the CSI specification. An example I encountered was when I stopped a job but the EC2 IAM instance role did not have DetachVolume permissions; the job stopped but the EBS volume was still attached to the EC2 instance.

When the ControllerUnpublishVolume was called, the client logs show the following:

2020-03-22T14:24:57.238Z [WARN] client.csi_client: finished client
unary call: plugin.name=aws-ebs0 plugin.type=controller
grpc.code=Internal duration=161.987586ms
grpc.service=csi.v1.Controller grpc.method=ControllerUnpublishVolume

But if we look at the controller plugin's alloc logs they show the real problem:

E0322 14:24:57.237631 1 driver.go:109] GRPC error: rpc error: code =
Internal desc = Could not detach volume "vol-09b7801ac621c83f5" from
node "i-0aa3ac7f05e936deb": could not detach volume
"vol-09b7801ac621c83f5" from node "i-0aa3ac7f05e936deb":
UnauthorizedOperation: You are not authorized to perform this
operation. Encoded authorization failure message: [REDACTED]

(logs redacted and line-broken for readability)

There's two things to fix here:

  • we should retry plugin RPCs (Automatically Retry Retriable RPC failures to CSI Plugins #6863) and then make much louder error messages when they fail.
  • we should document that some of the internal behaviors of plugins aren't visible to Nomad and that operators should check the plugin logs when debugging plugin failures.
@tgross tgross added theme/docs Documentation issues and enhancements theme/storage labels Mar 23, 2020
@tgross tgross added this to the 0.11.0 milestone Mar 23, 2020
@tgross
Copy link
Member Author

tgross commented Mar 30, 2020

The work done in #7549 for #6863 shows the places where we could hook better error handling.

In the CSI RPCs that can be retries, we only do so for timeout, codes.Unavailable and codes.ResourceExhausted are retried; all other errors are fatal. We can pull some information out of the error codes by digging into the documentation on the individual calls. For example, the errors for NodeStageVolume

@langmartin
Copy link
Contributor

The intention of the CSIVolume ResourceExhausted field is to capture the time of the of the last error on the volume, so that the scheduler can back off or wait for a resource to be freed, we might want to incorporate that as part of this ticket

@tgross
Copy link
Member Author

tgross commented May 13, 2020

An example where we could be doing better without having to radically rework how we interface with the RPCs is #7931 (comment), where we can't communicate with the plugin because of file permissions on the socket.

@tgross
Copy link
Member Author

tgross commented May 15, 2020

I spent the morning digging through the hooks we can get via gRPC and actually there's nothing else we can do here with the messages we're getting back. What we can do is to make sure that the server RPCs are wrapping messages we get back from the clients nicely so that the CLI user gets better feedback when it's available. And we can give some direction to check the allocation logs.

Also note that since we opened this issue we implemented #7547, which threads some of these client-side messages up through the node events and that actually makes this kind of issue quite a bit better:

Recent Events:
Time                       Type           Description
2020-05-15T14:11:21-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: rpc error: code = Internal desc = Could not attach volume "vol-03150160e19dbb5dd" to node "i-0e096b3a1fd9d7f6c": could not attach volume "vol-03150160e19dbb5dd" to node "i-0e096b3a1fd9d7f6c": UnauthorizedOperation: You are not authorized to perform this operation. Encoded authorization failure message: REDACTED
        status code: 403, request id: 7da740ea-ee49-4170-a8c2-ef00be8886f1
2020-05-15T14:11:20-04:00  Received       Task received by client

@tgross
Copy link
Member Author

tgross commented May 15, 2020

Will be closed by #7984

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/docs Documentation issues and enhancements theme/observability theme/storage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants