Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"event not processed by enough 'sink' nodes" affecting audit requests #24376

Closed
lboynton opened this issue Dec 5, 2023 · 9 comments
Closed

Comments

@lboynton
Copy link
Contributor

lboynton commented Dec 5, 2023

Describe the bug
After upgrading to Vault 1.15.3 I am still observing "event not processed by enough 'sink' nodes" errors in our Vault log. #23871 has partially fixed the problem, however it only addresses audit responses and not audit requests. Is there a similar fix to apply for audit requests too?

2023-12-05T15:49:28.249Z [ERROR] core: failed to audit request: path=sys/leases/renew
  error=
  | 1 error occurred:
  | \t* event not processed by enough 'sink' nodes
  | 
  
2023-12-05T15:49:33.158Z [ERROR] secrets.system.system_072cc7dc: lease renewal failed: lease_id=database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg
  error=
  | failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
  | \t* timeout: context canceled
  | 

To Reproduce
TBC - I'm going to see if I can reproduce it in a clean dev Vault install.

Expected behavior
No error message.

Environment:

  • Vault Server Version (retrieve with vault status): 1.15.3
  • Vault CLI Version (retrieve with vault version): 1.15.3
  • Server Operating System/Architecture: MacOS arm64/Linux amd64
@miagilepner
Copy link
Contributor

Hi, I'm talking to the team to try to determine what the clarify the correct behavior here. This was something that we explicitly didn't change as part of #23871 because if the request context is canceled before the request can be audited, no state will have changed in Vault.

@marcboudreau
Copy link
Contributor

Hi 👋,
I can provide a bit of context here to explain the observed behaviour. Starting in Vault 1.15.0, a different framework is used to publish audit log records (for both request and response) to their various Audit Devices (which is what 'sink' is referring to in the error). This new framework will return an error if the request deadline has been exceed (tracked by the context); whereas the old framework did not check if the deadline had been exceeded before writing the audit record. The API response was the same though, an error would be returned because the request took too long to service. The difference being that in 1.15.x, the response audit record is missing from the Audit Devices and the server logs include the above mentioned error; contrasted with < 1.15.x, where there was a response audit record showing the error returned to the API client written to the Audit Devices and no error in the server logs.

@marcboudreau
Copy link
Contributor

Quick update: as of Vault 1.15.4, a change was introduce to avoid this exact issue. A new context with a fresh timeout value of 5 seconds is used to publish the response audit log record to the Audit Devices, so Value should now exhibit the same behaviour as prior to 1.15.0 when the context deadline has been reached while processing a request, which is that the response audit log will be written and will show the error that was sent back to the client.

@marcboudreau
Copy link
Contributor

I'm going to close this Issue since the undesirable behaviour has been addressed already in Vault 1.15.4.

@lboynton
Copy link
Contributor Author

A partial fix in #23871 was made and landed in Vault 1.15.3, I'm not aware of any changes in 1.15.4 that fixes this.

Hi, I'm talking to the team to try to determine what the clarify the correct behavior here. This was something that we explicitly didn't change as part of #23871 because if the request context is canceled before the request can be audited, no state will have changed in Vault.

Would it be possible to simply ignore the error and avoid incrementing the vault_audit_log_request_failure metric? It should be possible to use this metric to know when an audit device fails, however it is misleading here as the audit device is not failing.

@KKatsnel
Copy link

KKatsnel commented Feb 6, 2024

@marcboudreau Hi,
We've just faced this issue after upgrading to 1.15.5, and there is no answer to the question from the previous message: would it be possible to ignore the error and avoid incrementing the vault_audit_log_request_failure metric?
It is quite an important metric, so we don't want to disable an alert for it.

@nilune
Copy link

nilune commented Feb 14, 2024

1.15.5 The same issue with metric

@aphorise
Copy link
Contributor

This issue is likely resolved since Vault 1.16.3 as per the notes on issue: #25549 & CHANGELOG.md notes:

  • core/audit: Audit logging a Vault request/response will now use a minimum 5 second context timeout. If the existing context deadline occurs later than 5s in the future, it will be used, otherwise a new context, separate from the original will be used. [GH-26616]

@aphorise
Copy link
Contributor

aphorise commented Oct 4, 2024

A correction to the earlier comment:

This issue is likely resolved since Vault 1.16.3 as per the notes on issue: #25549 & CHANGELOG.md notes:

Fixes were included up to version 1.16.10 (for other use cases) - however the best version to target that is likely to resolve all know occurrence is Vault 1.16.10 or higher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants