Vector start getting 401 after hour of continous log sending to gcp bucket #18432

infor7 · 2023-08-30T17:02:14Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hi!

Vector is getting http 401 statuses when using gcp_cloud_storage sink after more than 1 hour of continuous work. To authenticate with gcp kubernetes pod is using workload identity. That is big problem for us because we are loosing data - after 401 logs are not re transmitted. https://vector.dev/docs/reference/configuration/sinks/gcp_cloud_storage/#retry-policy

After few minutes it starts working properly and send data for another 25-55 mins to again catch 401 for few next minutes. We have hundreds of pods that experiencing 401 and start time and duration is totally random.

I see that you have plan to allow retry on more statues. I'll really appreciate if you enable it on 401
#10870

Getting back to analysis of issue.

When taking tcp dump and analyzing it in wireshark I can see that tokens are always refreshed properly - in hour I see 3 to 4 requests per hour to metadata.google.internal and always first of them is hitting expires_in 3599.

To analyze it further I used squid proxy to decrypt requests headers that is send to gcp bucket. I see that vector is always using fresh token, but before expiry time requests using that token returns 401.

To check if that is cloud problem with GCP support we logged into container in pod, which in the time of debugging experienced 401, and execute curl to get token (in curl I have used header 0.24 but in reality we have vector 0.30)

curl http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"

token has nearly 3000 seconds of validity.

I've used that token to send data to bucket using curl

curl -X PUT --header "x-goog-storage-class: STANDARD" --header "content-type: application/json" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity" --header "authorization: Bearer $TOKEN_GCP" -d 'Test request' "https://storage.googleapis.com/<bucket_name>/test_upload-proxy-6fcfc6dd77-lv7w2.log"

and that request ended successfully. We tried to run second process of vector in the same container manually using same configuration file and second process started sending data without any 401 while original vector process still experienced 401 in same time.

Could you help with debugging that problem?

Configuration

"sinks":
  "analytics_bucket":
    "batch":
      "max_bytes": 134217728
      "timeout_secs": 60
    "bucket": "<bucket_name>"
    "compression": "none"
    "encoding":
      "codec": "text"
    "inputs":
      - "analytics_enhance"
    "key_prefix": "download-proxy/"
    "type": "gcp_cloud_storage"
"sources":
  "analytics_logs":
    "exclude":
      - "*.gz"
      - "*.zip"
      - "*.swp"
    "include":
      - "/var/analytics/**/*.log"
      - "/var/analytics/**/*.jsonl"
      - "/box/var/log/application/v2_hdfs*/**/*.log"
      - "/box/var/log/application/v2_hdfs*/**/*.jsonl"
    "type": "file"
"transforms":
  "analytics_enhance":
    "inputs":
      - "analytics_logs"
    "source": |
      new_log = {}

      # Apply standard tags
      .az_name = "${AZ_NAME:-unknown_az}"
      .region = "${REGION:-unknown_region}"

      event = {}
      # special handling for v2_hdfs_logs
      if contains(string!(.file), "v2_hdfs_logs") {
          matches = parse_regex!(.file, r'.*/v2_hdfs_logs_(?P<topic>[a-zA-Z0-9_-]+).log')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          # These events are expected to be url-encoded with prefix
          # Prefix is unneeded, and is discarded
          this = split!(del(.message), " ")
          event, err = parse_query_string(this[-1])
          if err != null {
              # Invalid event
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      } else {
          matches = parse_regex!(.file, r'.*/(?P<topic>[a-zA-Z0-9_-]+).(log|jsonl)')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          this = del(.message)
          # Try to parse as json then url-encoded query string
          event, err = parse_json(this) ?? parse_query_string(this)
          if err != null {
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      }

      # Add each field of the message to the new event using percent encoding
      for_each(object!(event)) -> |key, value| {
          encoded_value = encode_percent(to_string!(value))
          new_log = set!(new_log, path: [key], data: encoded_value)
      }

      # Overwrite the event with our new log event, in url-encoded format
      . = encode_key_value(new_log, field_delimiter:"&")
    "type": "remap"

Version

0.30.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

#10870

The text was updated successfully, but these errors were encountered:

StephenWakely · 2023-09-12T14:13:45Z

I haven't been able to reproduce this. I can send data to gcp for several hours without any authentication issues.

I can't think why gcp would be returning 401 as you have verified it is sending the correct token. I think that just retrying a 401 could be hiding a deeper problem, but since I can't reproduce it I can't really verify this.

I've created a custom build here based on this branch that should retry on receiving a 401. If you can, would you be able to test this and see if it fixes the issue? We can then decide what the best way to move forward is.

infor7 · 2023-09-13T09:43:25Z

Thanks. I'll deploy change in dev and stage and let you know if that resolve our problem with doped logs.

Does vector have any flag which force it to send data to GCS using http instead of https? Maybe collecting full tcp dump will show us what is happening. I've collected the http dump and access log from squid separately.

StephenWakely · 2023-09-13T10:09:37Z

Thanks. I'll deploy change in dev and stage and let you know if that resolve our problem with doped logs.

Does vector have any flag which force it to send data to GCS using http instead of https? Maybe collecting full tcp dump will show us what is happening. I've collected the http dump and access log from squid separately.

No, unfortunately we hard code the https.

infor7 · 2023-09-15T12:17:06Z

I've run custom vector on dev pods of two applications and they still have 401 and data is discarded on attached charts.

Version on pods looks correct.

[root@download-proxy-86459f76cf-s9bl4 /] UNKNOWN_NODE_NAME $ vector -V
vector 0.32.1 (x86_64-unknown-linux-gnu 9965884 2023-08-21 14:52:38.330227446)

We have vector set up as sidecar for each pod. I've made workaround by SIGTERM'ing each vector sidecar every hour. At the moment it is causing CPU spikes on start, when it needs to catch up on data, but we do not have better alternative.

Restart is done by just prefixing vector with timeout command. With that workaround there are no 401 nor discarded logs.

timeout 1h /usr/bin/vector --config-dir /etc/vector

StephenWakely · 2023-09-15T12:22:51Z

Version on pods looks correct.

The version for the custom build should say:

〉bin/vector -V
vector 0.32.1 (x86_64-unknown-linux-gnu 989ad14 2023-09-12 11:41:27.237010392)

Date should be September 12.

infor7 · 2023-09-15T12:28:06Z

Oh, you are right I've just put wrong version number in deploy pipeline. Thank you!
I'll deploy change with right version and on Monday return with results.

infor7 · 2023-09-18T09:27:47Z

The custom build works better than I expected! Thank you!

After deployment on dev not only discarded events go to 0 but responses 401 are not longer present.

@StephenWakely do you have rough estimate when will be possible to include the fix into main build?

StephenWakely · 2023-09-18T10:52:47Z

I've pushed a PR and unless anyone can think of a good reason why it shouldn't go ahead, we should get it into the next release.

infor7 added the type: bug A code related bug. label Aug 30, 2023

StephenWakely mentioned this issue Sep 18, 2023

fix(gcp service): retry on unauthorized #18586

Merged

jszwedko closed this as completed in #18586 Sep 19, 2023

infor7 mentioned this issue Apr 3, 2024

Vector do not refresh workload identity on time. #20231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

infor7 commented Aug 30, 2023 •

edited

Loading

StephenWakely commented Sep 12, 2023

infor7 commented Sep 13, 2023

StephenWakely commented Sep 13, 2023

infor7 commented Sep 15, 2023 •

edited

Loading

StephenWakely commented Sep 15, 2023

infor7 commented Sep 15, 2023

infor7 commented Sep 18, 2023

StephenWakely commented Sep 18, 2023

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

Vector start getting 401 after hour of continous log sending to gcp bucket #18432

Comments

infor7 commented Aug 30, 2023 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

StephenWakely commented Sep 12, 2023

infor7 commented Sep 13, 2023

StephenWakely commented Sep 13, 2023

infor7 commented Sep 15, 2023 • edited Loading

StephenWakely commented Sep 15, 2023

infor7 commented Sep 15, 2023

infor7 commented Sep 18, 2023

StephenWakely commented Sep 18, 2023

infor7 commented Aug 30, 2023 •

edited

Loading

infor7 commented Sep 15, 2023 •

edited

Loading