Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make server side error messages from vault more clearer #3968

Merged
merged 6 commits into from
Mar 14, 2018

Conversation

preetapan
Copy link
Member

@preetapan preetapan commented Mar 12, 2018

new code

Recent Events:
Time                       Type             Description
2018-03-13T17:05:21-05:00  Alloc Unhealthy  Unhealthy because of failed task
2018-03-13T17:05:21-05:00  Killing          vault: server error in deriving vault token: Nomad Server failed to establish connections to Vault: failed to lookup Vault periodic token: Error making API request.

URL: POST http://127.0.0.1:8200/v1/auth/token/looku
Code: 503. Errors:

* Vault is sealed
2018-03-13T17:05:21-05:00  Task Setup       Building Task Directory

master

Recent Events:
Time                       Type             Description
2018-03-12T16:50:04-05:00  Alloc Unhealthy  Unhealthy because of failed task
2018-03-12T16:50:04-05:00  Killing          Killing task: vault: failed to derive token: failed to create token for task "redis" on alloc "259d25ad-f17d-03bf-691a-95ebb9050479": Nomad Server failed to establish connections to Vault: failed to lookup Vault periodic token: Error making API request.

URL: POST http://127.0.0.1:8200/v1/auth/token/lookup
Code: 503. Errors:

* Vault is sealed
2018-03-12T16:50:04-05:00  Task Setup       Building Task Directory
..

@preetapan
Copy link
Member Author

@dadgar @schmichael open to better ways to make this even nicer. Even now there are multiple wrapping errors in the description in recent events, but the first thing in the message is that its a server error so that should make it clear that it needs operator intervention on nomad servers to fix.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the IsRecoverable issue but otherwise looks good!


return &RecoverableError{
Err: e.Error(),
ServerSideErr: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be true?

}

// IsRecoverable returns true if error is a RecoverableError with
// Recoverable=true. Otherwise false is returned.
func IsRecoverable(e error) bool {
if re, ok := e.(Recoverable); ok {
if re, ok := e.(ErrorMeta); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getter.GetError also implements IsRecoverable so we shouldn't change this assertion. Maybe keep the Recoverable interface as is and add:

type ErrorMeta interface {
  Recoverable
  IsServerSideError
}

...or we could implement IsServerSideError on getter.GetError ... the meaning there seems a bit ambiguous though. Would a 400 fetching an artifact be a server-side error? I think no because Server refers to a Nomad Server, but we'd still want to document the interface's exact behavior thoroughly.

nomad/vault.go Outdated
@@ -429,7 +429,7 @@ OUTER:
retryTimer.Reset(v.config.ConnectionRetryIntv)
v.l.Lock()
v.connEstablished = true
v.connEstablishedErr = fmt.Errorf("Nomad Server failed to establish connections to Vault: %v", err)
v.connEstablishedErr = structs.NewServerSideError(fmt.Errorf("Nomad Server failed to establish connections to Vault: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be really nice if Vault didn't reformat API errors into one giant string in the api package :(

func (r *Response) Error() error {
// 200 to 399 are okay status codes. 429 is the code for health status of
// standby nodes.
if (r.StatusCode >= 200 && r.StatusCode < 400) || r.StatusCode == 429 {
return nil
}
// We have an error. Let's copy the body into our own buffer first,
// so that if we can't decode JSON, we can at least copy it raw.
var bodyBuf bytes.Buffer
if _, err := io.Copy(&bodyBuf, r.Body); err != nil {
return err
}
// Decode the error response if we can. Note that we wrap the bodyBuf
// in a bytes.Reader here so that the JSON decoder doesn't move the
// read pointer for the original buffer.
var resp ErrorResponse
if err := jsonutil.DecodeJSON(bodyBuf.Bytes(), &resp); err != nil {
// Ignore the decoding error and just drop the raw response
return fmt.Errorf(
"Error making API request.\n\n"+
"URL: %s %s\n"+
"Code: %d. Raw Message:\n\n%s",
r.Request.Method, r.Request.URL.String(),
r.StatusCode, bodyBuf.String())
}
var errBody bytes.Buffer
errBody.WriteString(fmt.Sprintf(
"Error making API request.\n\n"+
"URL: %s %s\n"+
"Code: %d. Errors:\n\n",
r.Request.Method, r.Request.URL.String(),
r.StatusCode))
for _, err := range resp.Errors {
errBody.WriteString(fmt.Sprintf("* %s", err))
}
return fmt.Errorf(errBody.String())
}

If that just returned fmt.Errorf(bodyBuf.String()) our error messages would be much nicer! No embedded newlines and useless HTTP metadata (url+status).

Ideally Vault would return a custom error type that allowed us to pluck out the bits of metadata we cared about.

No need to block this PR on that, but do you think we should open an issue with them?

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one small change

@@ -1143,7 +1143,14 @@ func (n *Node) DeriveVaultToken(args *structs.DeriveVaultTokenRequest,
if e == nil {
return
}
reply.Error = structs.NewRecoverableError(e, recoverable).(*structs.RecoverableError)
re, ok := e.(structs.Recoverable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert to *structs.RecoverableError here and skip the second assertion below. One less assertion and there are multiple types that implement structs.Recoverable (even if right now only one of them is used here).

@@ -6118,22 +6133,34 @@ func (r *RecoverableError) IsRecoverable() bool {
return r.Recoverable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comment

client/client.go Outdated
@@ -1754,11 +1754,11 @@ func (c *Client) deriveToken(alloc *structs.Allocation, taskNames []string, vcli
var resp structs.DeriveVaultTokenResponse
if err := c.RPC("Node.DeriveVaultToken", &req, &resp); err != nil {
c.logger.Printf("[ERR] client.vault: DeriveVaultToken RPC failed: %v", err)
return nil, fmt.Errorf("DeriveVaultToken RPC failed: %v", err)
return nil, structs.NewWrappedServerError(resp.Error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return the err not the resp error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should it be marked as a server error? It is likely there is no valid server which is where the error is coming from since all errors other than network are returned via the resp.Error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this should not be marked as such, will fix

if structs.IsServerSide(err) {
r.logger.Printf("[ERR] client: failed to derive Vault token for task %v on alloc %q: %v",
r.task.Name, r.alloc.ID, err)
r.Kill("vault", fmt.Sprintf("server error in deriving vault token: %v", err), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove in

@preetapan preetapan merged commit c4389cd into master Mar 14, 2018
@preetapan preetapan deleted the f-nicer-vault-error branch March 14, 2018 01:49
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants