Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

lboynton · 2023-10-26T21:18:02Z

Describe the bug
Since upgrading to Vault 1.15, we are seeing occasional errors in the logs such as:

2023-10-26T22:04:43.804+0100 [ERROR] secrets.system.system_9ba98b00: lease renewal failed: lease_id=database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg
  error=
  | failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
  | \t* timeout: context canceled
  | 
  
2023-10-26T22:04:43.805+0100 [ERROR] core: failed to audit response: request_path=sys/leases/renew
  error=
  | 1 error occurred:
  | \t* event not processed by enough 'sink' nodes
  |

The first error is possible due to network latency/timeouts to a database when using dynamic credentials. The second error is new and is due to the new audit logging behaviour in Vault 1.15. This causes the vault_audit_log_request_failure telemetry metric to be incremented and therefore could trigger alerts unnecessarily.

To Reproduce
I've managed to reproduce this with a vault server -dev instance locally:

Follow the getting started guide using a local Vault instance and a local postgres instance running in docker.
Enable audit logs with vault audit enable file file_path=/tmp/vault_audit.log
Install toxiproxy locally which we will use to simulate some network latency/jitter. Start the server with toxiproxy-server.
Set up a backend in toxiproxy toxiproxy-cli create -l localhost:5433 -u localhost:5432 postgres
Add a "toxic" to simulate some network latency/jitter toxiproxy-cli toxic add -t latency -a latency=1000 -a jitter=2000 postgres
Tell Vault to use toxiproxy instead of postgres direct:

vault write database/config/postgresql \                                
     plugin_name=postgresql-database-plugin \
     connection_url="postgresql://{{username}}:{{password}}@$POSTGRES_URL:5433/postgres?sslmode=disable" \
     allowed_roles=readonly \
     username="root" \
     password="rootpassword"

Create some dynamic credentials: vault read database/creds/readonly
Generate a curl command to renew the lease: vault lease renew --output-curl-string database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg (update the lease ID as needed)
Update the curl command to have a timeout and run the command: curl --max-time 1.2 -X PUT -H "X-Vault-Request: true" -H "X-Vault-Token: $(vault print token)" -d '{"increment":0,"lease_id":"database/creds/readonly/weDQJArsWMkCUwVBf7zzOVwg"}' http://127.0.0.1:8200/v1/sys/leases/renew
Review the Vault error log

You should see an event not processed by enough 'sink' nodes error here. This error can happen in a dynamic environment where requests are made to renew leases, which can take a bit of time, and the request is cancelled while it's being processed. This can happen in k8s when pods are spinning up/down frequently.

Expected behavior
Although the lease renewal failed, I would not expect this to cause an error when writing to the audit file. I don't believe it caused a problem in earlier versions of Vault.

Environment:

Vault Server Version (retrieve with vault status): 1.15.1
Vault CLI Version (retrieve with vault version): 1.15.1
Server Operating System/Architecture: MacOS arm64/Linux amd64

Vault server configuration file(s): Not applicable as this is using a vault server -dev instance with no special configuration.

Setting VAULT_AUDIT_DISABLE_EVENTLOGGER=true as an env var to revert back to the previous audit log behaviour stops the error appearing in the logs:

2023-10-27T10:02:31.007+0100 [ERROR] secrets.system.system_d1c6c5f9: lease renewal failed: lease_id=database/creds/readonly/vHGElhu0CbF6VB9XgAbXIHGa
  error=
  | failed to renew entry: resp: (*logical.Response)(nil) err: 1 error occurred:
  | \t* context canceled
  |

Only the "context canceled" error is displayed when the lease renewal request is cancelled.

The text was updated successfully, but these errors were encountered:

dvdlevanon-sgh · 2023-11-22T14:05:26Z

I'm getting the same with HA enabled, and not only this error but vault itself stop functioning when more replicas are joining the cluster.

Workaround it by downgrading from 1.15.1 to 1.14.6.

ccapurso added the core/audit label Oct 31, 2023

peteski22 self-assigned this Nov 22, 2023

peteski22 mentioned this issue Nov 22, 2023

Audit: logging a response uses a separate 5 second timeout #24238

Merged

raskchanky closed this as completed in #24238 Nov 22, 2023

hc-github-team-secure-vault-core mentioned this issue Nov 22, 2023

Backport of Audit: logging a response uses a separate 5 second timeout into release/1.15.x #24241

Merged

lboynton mentioned this issue Dec 5, 2023

"event not processed by enough 'sink' nodes" affecting audit requests #24376

Closed

KKatsnel mentioned this issue Feb 21, 2024

"event not processed by enough 'sink' nodes" error affecting vault.audit.log_request_failure metric #25549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

lboynton commented Oct 26, 2023 •

edited

Loading

dvdlevanon-sgh commented Nov 22, 2023

Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

Error "event not processed by enough 'sink' nodes" when file audit enabled in Vault 1.15 #23871

Comments

lboynton commented Oct 26, 2023 • edited Loading

dvdlevanon-sgh commented Nov 22, 2023

lboynton commented Oct 26, 2023 •

edited

Loading