-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector start getting 401 after hour of continous log sending to gcp bucket #18432
Comments
I haven't been able to reproduce this. I can send data to gcp for several hours without any authentication issues. I can't think why gcp would be returning 401 as you have verified it is sending the correct token. I think that just retrying a 401 could be hiding a deeper problem, but since I can't reproduce it I can't really verify this. I've created a custom build here based on this branch that should retry on receiving a 401. If you can, would you be able to test this and see if it fixes the issue? We can then decide what the best way to move forward is. |
Thanks. I'll deploy change in dev and stage and let you know if that resolve our problem with doped logs. Does vector have any flag which force it to send data to GCS using http instead of https? Maybe collecting full tcp dump will show us what is happening. I've collected the http dump and access log from squid separately. |
No, unfortunately we hard code the |
I've run custom vector on dev pods of two applications and they still have 401 and data is discarded on attached charts. Version on pods looks correct.
We have vector set up as sidecar for each pod. I've made workaround by SIGTERM'ing each vector sidecar every hour. At the moment it is causing CPU spikes on start, when it needs to catch up on data, but we do not have better alternative. Restart is done by just prefixing vector with timeout command. With that workaround there are no 401 nor discarded logs.
|
The version for the custom build should say: 〉bin/vector -V
vector 0.32.1 (x86_64-unknown-linux-gnu 989ad14 2023-09-12 11:41:27.237010392) Date should be September 12. |
Oh, you are right I've just put wrong version number in deploy pipeline. Thank you! |
The custom build works better than I expected! Thank you! After deployment on dev not only discarded events go to 0 but responses 401 are not longer present. @StephenWakely do you have rough estimate when will be possible to include the fix into main build? |
I've pushed a PR and unless anyone can think of a good reason why it shouldn't go ahead, we should get it into the next release. |
A note for the community
Problem
Hi!
Vector is getting http 401 statuses when using gcp_cloud_storage sink after more than 1 hour of continuous work. To authenticate with gcp kubernetes pod is using workload identity. That is big problem for us because we are loosing data - after 401 logs are not re transmitted. https://vector.dev/docs/reference/configuration/sinks/gcp_cloud_storage/#retry-policy
After few minutes it starts working properly and send data for another 25-55 mins to again catch 401 for few next minutes. We have hundreds of pods that experiencing 401 and start time and duration is totally random.
I see that you have plan to allow retry on more statues. I'll really appreciate if you enable it on 401
#10870
Getting back to analysis of issue.
When taking tcp dump and analyzing it in wireshark I can see that tokens are always refreshed properly - in hour I see 3 to 4 requests per hour to
metadata.google.internal
and always first of them is hitting expires_in 3599.To analyze it further I used squid proxy to decrypt requests headers that is send to gcp bucket. I see that vector is always using fresh token, but before expiry time requests using that token returns 401.
To check if that is cloud problem with GCP support we logged into container in pod, which in the time of debugging experienced 401, and execute curl to get token (in curl I have used header 0.24 but in reality we have vector 0.30)
curl http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"
token has nearly 3000 seconds of validity.
I've used that token to send data to bucket using curl
curl -X PUT --header "x-goog-storage-class: STANDARD" --header "content-type: application/json" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity" --header "authorization: Bearer $TOKEN_GCP" -d 'Test request' "https://storage.googleapis.com/<bucket_name>/test_upload-proxy-6fcfc6dd77-lv7w2.log"
and that request ended successfully. We tried to run second process of vector in the same container manually using same configuration file and second process started sending data without any 401 while original vector process still experienced 401 in same time.
Could you help with debugging that problem?
Configuration
Version
0.30.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
#10870
The text was updated successfully, but these errors were encountered: