Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

Closed
lboynton opened this issue Oct 26, 2023 · 1 comment · Fixed by #24238
Closed
Assignees

Comments

@lboynton
Copy link
Contributor

lboynton commented Oct 26, 2023

Describe the bug
Since upgrading to Vault 1.15, we are seeing occasional errors in the logs such as:

2023-10-26T22:04:43.804+0100 [ERROR] secrets.system.system_9ba98b00: lease renewal failed: lease_id=database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg
  error=
  | failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
  | \t* timeout: context canceled
  | 
  
2023-10-26T22:04:43.805+0100 [ERROR] core: failed to audit response: request_path=sys/leases/renew
  error=
  | 1 error occurred:
  | \t* event not processed by enough 'sink' nodes
  | 
  

The first error is possible due to network latency/timeouts to a database when using dynamic credentials. The second error is new and is due to the new audit logging behaviour in Vault 1.15. This causes the vault_audit_log_request_failure telemetry metric to be incremented and therefore could trigger alerts unnecessarily.

To Reproduce
I've managed to reproduce this with a vault server -dev instance locally:

  1. Follow the getting started guide using a local Vault instance and a local postgres instance running in docker.
  2. Enable audit logs with vault audit enable file file_path=/tmp/vault_audit.log
  3. Install toxiproxy locally which we will use to simulate some network latency/jitter. Start the server with toxiproxy-server.
  4. Set up a backend in toxiproxy toxiproxy-cli create -l localhost:5433 -u localhost:5432 postgres
  5. Add a "toxic" to simulate some network latency/jitter toxiproxy-cli toxic add -t latency -a latency=1000 -a jitter=2000 postgres
  6. Tell Vault to use toxiproxy instead of postgres direct:
vault write database/config/postgresql \                                
     plugin_name=postgresql-database-plugin \
     connection_url="postgresql://{{username}}:{{password}}@$POSTGRES_URL:5433/postgres?sslmode=disable" \
     allowed_roles=readonly \
     username="root" \
     password="rootpassword"
  1. Create some dynamic credentials: vault read database/creds/readonly
  2. Generate a curl command to renew the lease: vault lease renew --output-curl-string database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg (update the lease ID as needed)
  3. Update the curl command to have a timeout and run the command: curl --max-time 1.2 -X PUT -H "X-Vault-Request: true" -H "X-Vault-Token: $(vault print token)" -d '{"increment":0,"lease_id":"database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg"}' http://127.0.0.1:8200/v1/sys/leases/renew
  4. Review the Vault error log

You should see an event not processed by enough 'sink' nodes error here. This error can happen in a dynamic environment where requests are made to renew leases, which can take a bit of time, and the request is cancelled while it's being processed. This can happen in k8s when pods are spinning up/down frequently.

Expected behavior
Although the lease renewal failed, I would not expect this to cause an error when writing to the audit file. I don't believe it caused a problem in earlier versions of Vault.

Environment:

  • Vault Server Version (retrieve with vault status): 1.15.1
  • Vault CLI Version (retrieve with vault version): 1.15.1
  • Server Operating System/Architecture: MacOS arm64/Linux amd64

Vault server configuration file(s): Not applicable as this is using a vault server -dev instance with no special configuration.

Setting VAULT_AUDIT_DISABLE_EVENTLOGGER=true as an env var to revert back to the previous audit log behaviour stops the error appearing in the logs:

2023-10-27T10:02:31.007+0100 [ERROR] secrets.system.system_d1c6c5f9: lease renewal failed: lease_id=database/creds/readonly/vHGElhu0CbF6VB9XgAbXIHGa
  error=
  | failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
  | \t* context canceled
  |

Only the "context canceled" error is displayed when the lease renewal request is cancelled.

@dvdlevanon-sgh
Copy link

I'm getting the same with HA enabled, and not only this error but vault itself stop functioning when more replicas are joining the cluster.

Workaround it by downgrading from 1.15.1 to 1.14.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants