Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

Closed
infor7 opened this issue Aug 30, 2023 · 8 comments · Fixed by #18586
Closed

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

infor7 opened this issue Aug 30, 2023 · 8 comments · Fixed by #18586
Labels
type: bug A code related bug.

Comments

@infor7
Copy link

infor7 commented Aug 30, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hi!

Vector is getting http 401 statuses when using gcp_cloud_storage sink after more than 1 hour of continuous work. To authenticate with gcp kubernetes pod is using workload identity. That is big problem for us because we are loosing data - after 401 logs are not re transmitted. https://vector.dev/docs/reference/configuration/sinks/gcp_cloud_storage/#retry-policy

After few minutes it starts working properly and send data for another 25-55 mins to again catch 401 for few next minutes. We have hundreds of pods that experiencing 401 and start time and duration is totally random.

I see that you have plan to allow retry on more statues. I'll really appreciate if you enable it on 401
#10870

Getting back to analysis of issue.

When taking tcp dump and analyzing it in wireshark I can see that tokens are always refreshed properly - in hour I see 3 to 4 requests per hour to metadata.google.internal and always first of them is hitting expires_in 3599.

To analyze it further I used squid proxy to decrypt requests headers that is send to gcp bucket. I see that vector is always using fresh token, but before expiry time requests using that token returns 401.

To check if that is cloud problem with GCP support we logged into container in pod, which in the time of debugging experienced 401, and execute curl to get token (in curl I have used header 0.24 but in reality we have vector 0.30)

curl http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"

token has nearly 3000 seconds of validity.

I've used that token to send data to bucket using curl

curl -X PUT --header "x-goog-storage-class: STANDARD" --header "content-type: application/json" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity" --header "authorization: Bearer $TOKEN_GCP" -d 'Test request' "https://storage.googleapis.com/<bucket_name>/test_upload-proxy-6fcfc6dd77-lv7w2.log"

and that request ended successfully. We tried to run second process of vector in the same container manually using same configuration file and second process started sending data without any 401 while original vector process still experienced 401 in same time.

Could you help with debugging that problem?

Configuration

"sinks":
  "analytics_bucket":
    "batch":
      "max_bytes": 134217728
      "timeout_secs": 60
    "bucket": "<bucket_name>"
    "compression": "none"
    "encoding":
      "codec": "text"
    "inputs":
      - "analytics_enhance"
    "key_prefix": "download-proxy/"
    "type": "gcp_cloud_storage"
"sources":
  "analytics_logs":
    "exclude":
      - "*.gz"
      - "*.zip"
      - "*.swp"
    "include":
      - "/var/analytics/**/*.log"
      - "/var/analytics/**/*.jsonl"
      - "/box/var/log/application/v2_hdfs*/**/*.log"
      - "/box/var/log/application/v2_hdfs*/**/*.jsonl"
    "type": "file"
"transforms":
  "analytics_enhance":
    "inputs":
      - "analytics_logs"
    "source": |
      new_log = {}

      # Apply standard tags
      .az_name = "${AZ_NAME:-unknown_az}"
      .region = "${REGION:-unknown_region}"

      event = {}
      # special handling for v2_hdfs_logs
      if contains(string!(.file), "v2_hdfs_logs") {
          matches = parse_regex!(.file, r'.*/v2_hdfs_logs_(?P<topic>[a-zA-Z0-9_-]+).log')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          # These events are expected to be url-encoded with prefix
          # Prefix is unneeded, and is discarded
          this = split!(del(.message), " ")
          event, err = parse_query_string(this[-1])
          if err != null {
              # Invalid event
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      } else {
          matches = parse_regex!(.file, r'.*/(?P<topic>[a-zA-Z0-9_-]+).(log|jsonl)')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          this = del(.message)
          # Try to parse as json then url-encoded query string
          event, err = parse_json(this) ?? parse_query_string(this)
          if err != null {
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      }

      # Add each field of the message to the new event using percent encoding
      for_each(object!(event)) -> |key, value| {
          encoded_value = encode_percent(to_string!(value))
          new_log = set!(new_log, path: [key], data: encoded_value)
      }

      # Overwrite the event with our new log event, in url-encoded format
      . = encode_key_value(new_log, field_delimiter:"&")
    "type": "remap"

Version

0.30.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

#10870

@infor7 infor7 added the type: bug A code related bug. label Aug 30, 2023
@StephenWakely
Copy link
Contributor

I haven't been able to reproduce this. I can send data to gcp for several hours without any authentication issues.

I can't think why gcp would be returning 401 as you have verified it is sending the correct token. I think that just retrying a 401 could be hiding a deeper problem, but since I can't reproduce it I can't really verify this.

I've created a custom build here based on this branch that should retry on receiving a 401. If you can, would you be able to test this and see if it fixes the issue? We can then decide what the best way to move forward is.

@infor7
Copy link
Author

infor7 commented Sep 13, 2023

Thanks. I'll deploy change in dev and stage and let you know if that resolve our problem with doped logs.

Does vector have any flag which force it to send data to GCS using http instead of https? Maybe collecting full tcp dump will show us what is happening. I've collected the http dump and access log from squid separately.

@StephenWakely
Copy link
Contributor

Thanks. I'll deploy change in dev and stage and let you know if that resolve our problem with doped logs.

Does vector have any flag which force it to send data to GCS using http instead of https? Maybe collecting full tcp dump will show us what is happening. I've collected the http dump and access log from squid separately.

No, unfortunately we hard code the https.

@infor7
Copy link
Author

infor7 commented Sep 15, 2023

I've run custom vector on dev pods of two applications and they still have 401 and data is discarded on attached charts.

Screenshot 2023-09-15 at 14 14 47

Version on pods looks correct.

[root@download-proxy-86459f76cf-s9bl4 /] UNKNOWN_NODE_NAME $ vector -V
vector 0.32.1 (x86_64-unknown-linux-gnu 9965884 2023-08-21 14:52:38.330227446)

We have vector set up as sidecar for each pod. I've made workaround by SIGTERM'ing each vector sidecar every hour. At the moment it is causing CPU spikes on start, when it needs to catch up on data, but we do not have better alternative.

Restart is done by just prefixing vector with timeout command. With that workaround there are no 401 nor discarded logs.

timeout 1h /usr/bin/vector --config-dir /etc/vector

@StephenWakely
Copy link
Contributor

Version on pods looks correct.

The version for the custom build should say:

〉bin/vector -V
vector 0.32.1 (x86_64-unknown-linux-gnu 989ad14 2023-09-12 11:41:27.237010392)

Date should be September 12.

@infor7
Copy link
Author

infor7 commented Sep 15, 2023

Oh, you are right I've just put wrong version number in deploy pipeline. Thank you!
I'll deploy change with right version and on Monday return with results.

@infor7
Copy link
Author

infor7 commented Sep 18, 2023

The custom build works better than I expected! Thank you!

After deployment on dev not only discarded events go to 0 but responses 401 are not longer present.

Screenshot 2023-09-18 at 11 21 11

@StephenWakely do you have rough estimate when will be possible to include the fix into main build?

@StephenWakely
Copy link
Contributor

I've pushed a PR and unless anyone can think of a good reason why it shouldn't go ahead, we should get it into the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants