401 Unauthorized for GCP related sinks after a while #10828

dacaga · 2022-01-12T22:24:07Z

After executing properly for a while on a GKE pod, both of my GCP related sinks (Stackdriver and Google Cloud Storage) start giving me 401 Unauthorized errors.

~~The underlying librairies~~ GcpAuthConfig/GcpCredentials don't seem to be handling ~~JWT~~ access token renewal/refetch from the Metadata Server properly (since I'm not using a service account private key).

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

From the Docker image timberio/vector:0.18.1-debian

0.18.1

Vector Configuration File

[sinks.stackdriver]
type = "gcp_stackdriver_logs"
inputs = [ "stdin" ]
log_id = "logid"
project_id = "projectid"

[sinks.stackdriver.resource]
type = "global"

[sinks.gcs]
type = "gcp_cloud_storage"
inputs = [ "stdin" ]
bucket = "bucket"
key_prefix = "%F/"
compression = "gzip"
storage_class = "COLDLINE"

[sinks.gcs.encoding]
codec = "ndjson"

Actual Behavior

After a few hours of execution, the ~~JWT~~ access token expires and every time Vector tries to flush to its GCP related sinks I get this from stderr:

ERROR sink{component_kind="sink" component_id=gcs component_type=gcp_cloud_storage component_name=gcs}:request{request_id=6}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

...

ERROR sink{component_kind="sink" component_id=stackdriver component_type=gcp_stackdriver_logs component_name=stackdriver}:request{request_id=192}: vector::sinks::util::sink: Response failed. response=Response { status: 401, version: HTTP/1.1, headers: {"www-authenticate": "Bearer realm=\"https://accounts.google.com/\", error=\"invalid_token\"", "vary": "X-Origin", "vary": "Referer", "vary": "Origin,Accept-Encoding", "content-type": "application/json; charset=UTF-8", "date": "Mon, 10 Jan 2022 00:10:14 GMT", "server": "ESF", "cache-control": "private", "x-xss-protection": "0", "x-frame-options": "SAMEORIGIN", "x-content-type-options": "nosniff", "accept-ranges": "none", "transfer-encoding": "chunked"}, body: b"{\n  \"error\": {\n    \"code\": 401,\n    \"message\": \"Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\",\n    \"status\": \"UNAUTHENTICATED\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n        \"reason\": \"ACCESS_TOKEN_EXPIRED\",\n        \"domain\": \"googleapis.com\",\n        \"metadata\": {\n          \"method\": \"google.logging.v2.LoggingServiceV2.WriteLogEntries\",\n          \"service\": \"logging.googleapis.com\"\n        }\n      }\n    ]\n  }\n}\n" }

The text was updated successfully, but these errors were encountered:

spencergilbert · 2022-01-13T14:38:16Z

Is the auth tied to the ServiceAccount being used with the Vector pod? What version of Kubernetes are you running?

dacaga · 2022-01-13T16:38:26Z

Is the auth tied to the ServiceAccount being used with the Vector pod?

Yes and I double checked, the owner of the created objects on the GCS bucket is the pod's ServiceAccount.

What version of Kubernetes are you running?

1.18.20-gke.4501

spencergilbert · 2022-01-13T17:00:51Z

I'm wondering if it's similar to #8616 (comment), seems like Kubernetes has added some rotation times to service accounts.

dacaga · 2022-01-14T21:02:32Z

I don't think it's related since, in this case, the GCS and Stackdriver sinks are authenticating using a token grabbed from the GKE Metadata Server here:

vector/src/sinks/gcp/mod.rs

Lines 60 to 86 in 3160892

    
           async fn get_token_implicit() -> Result<Token, GcpError> { 
        
               let req = http::Request::get(SERVICE_ACCOUNT_TOKEN_URL) 
        
                   .header("Metadata-Flavor", "Google") 
        
                   .body(hyper::Body::empty()) 
        
                   .unwrap(); 
        
               let proxy = ProxyConfig::from_env(); 
        
               let res = HttpClient::new(None, &proxy) 
        
                   .context(BuildHttpClientSnafu)? 
        
                   .send(req) 
        
                   .await 
        
                   .context(GetImplicitTokenSnafu)?; 
        
               let body = res.into_body(); 
        
               let bytes = hyper::body::to_bytes(body) 
        
                   .await 
        
                   .context(GetTokenBytesSnafu)?; 
        
               // Token::from_str is irresponsible and may panic! 
        
               match serde_json::from_slice::<Token>(&bytes) { 
        
                   Ok(token) => Ok(token), 
        
                   Err(error) => Err(match serde_json::from_slice::<TokenErr>(&bytes) { 
        
                       Ok(error) => GcpError::TokenFromJson { source: error }, 
        
                       Err(_) => GcpError::TokenJsonFromStr { source: error }, 
        
                   }), 
        
               } 
        
           }

It's kept up-to-date with the task:

vector/src/sinks/gcp/mod.rs

Lines 131 to 150 in ea0d002

    
               pub fn spawn_regenerate_token(&self) { 
        
                   let this = self.clone(); 
        
                   let period = this.token.read().unwrap().expires_in() as u64 / 2; 
        
                   let interval = IntervalStream::new(tokio::time::interval(Duration::from_secs(period))); 
        
                   let task = interval.for_each(move |_| { 
        
                       let this = this.clone(); 
        
                       async move { 
        
                           debug!("Renewing GCP authentication token."); 
        
                           if let Err(error) = this.regenerate_token().await { 
        
                               error!( 
        
                                   message = "Failed to update GCP authentication token.", 
        
                                   %error 
        
                               ); 
        
                           } 
        
                       } 
        
                   }); 
        
                   tokio::spawn(task); 
        
               } 
        
           }

FYI: It has now been running in production for 2 days without any 401 error. It makes me think that it might be related to some sporadic and uncaught error with the Metadata Server.

Also, prior to the 401 errors I had, there are no signs in the logs that there had been a problem with the token renewal (nothing in the logs that says "Failed to update GCP authentication token.")

spencergilbert · 2022-01-14T21:22:30Z

Ah, gotcha - wasn't sure what it got tied to for auth. We could look to add more logging into the regen process if it comes up again, otherwise I'm not sure there's a good course of action.

dacaga · 2022-01-17T17:19:22Z

My thoughts too. It has not happened again yet. I'll keep en eye on it. You can close the issue if you want and I'll reopen if it happens again and I manage to catch more data.

mloughran · 2022-04-22T15:14:15Z

I am also seeing this sporadically. Scanning through vector logs this seems to recover automatically after < 1 minute, however I just experienced a 15 minute log outage on 1 VM, which again automatically recovered.

Running vector 0.19.1 on debian buster.

jdoupe · 2022-05-06T01:31:52Z

I am having this issue as well. At first it seemed to happen hourly and I thought it might be related to token expiration, but now sometimes it'll run for several hours.

2022-05-06T00:30:17.542658Z ERROR sink{component_kind="sink" component_id=mylogs component_type=gcp_cloud_storage component_name=mylogs}:request{request_id=709}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

To note: No k8s involved, and no recovery is experienced. Once it starts, it never clears up without a restart.

bruceg · 2022-05-10T20:45:10Z

I think this should be fixed by #12645, which should be in tonight's nightly build. Could you test and see if this is still happening for you?

jdoupe · 2022-05-12T02:52:05Z

Unfortunately it still occurred.

...
2022-05-11T21:31:02.875038Z  INFO vector: Vector has started. debug="false" version="0.22.0" arch="x86_64" build_id="e584ef2 2022-05-11"
...
2022-05-11T22:32:52.009788Z ERROR sink{component_kind="sink" component_id=mylogs component_type=gcp_cloud_storage component_name=gcs_doppler_logs}:request{request_id=12781}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

Update:
It now consistently fails upon the first renewal attempt (after 1 hour).

bruceg · 2022-05-18T14:40:22Z

I think this has been fixed by #12757 which was just merged, at least I was unable to reproduce the problem after several hours of continuous running. Are you set up to build from sources to test if this fixes the problem for you?

jdoupe · 2022-05-19T00:18:23Z

🤞 I've built and have it running. I'll let it run overnight and post any updates.

Update: didn't make it past the first hour. :(

jdoupe · 2022-05-19T21:33:11Z

In case you were wondering about how how much data we have flowing through, attached is a screenshot of vector top after an hour, right after the token goes bad (not that this indicates anything)

I ran today's nightly with debug, and there's no mention of it even trying to renew the token.

jszwedko · 2022-05-19T22:05:16Z

@jdoupe which nightly did you run with (what does vector --version say?)? The nightly build actually failed due to a network issue so it was only published ~7 hours ago when I reran it.

jdoupe · 2022-05-20T00:20:21Z

vector 0.22.0 (x86_64-unknown-linux-gnu 50dd7a7 2022-05-19)

jszwedko · 2022-05-20T00:44:15Z

Thanks for confirming @jdoupe . That should be after #12757 so it seems like maybe something is still amiss here. cc/ @bruceg (noting this is for the gcp_cloud_storage sink).

bruceg · 2022-05-20T22:27:23Z

I have set up a test configs sending to both gcp_pubsub and gcp_cloud_storage and have yet to reproduce this problem. I added a debug statement in the code and verified that the refresh is happening. I'm not sure how to diagnose this further.

jdoupe · 2022-05-20T22:38:01Z

Should I see the renewal debug output with the current nightly? (With just a single -v?)

bruceg · 2022-05-20T22:43:17Z

No, I will be submitting a PR for adding the details, as it would be useful for future diagnosis. Edit: See #12814

jdoupe · 2022-05-22T18:22:23Z

@bruceg - Thanks much for all your time on this...

Running vector 0.22.0 (x86_64-unknown-linux-gnu b002bfb 2022-05-22) for a few hours now...

Things looking good thus far. Just noting that it looks like the implicit renewal is never called.

$ grep -i "vector::gcp" vector.log.0.22.0.b002bfb.t8v
2022-05-22T15:34:47.368968Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T16:04:46.691864Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T16:04:46.691958Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T16:34:45.692488Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T16:34:45.692595Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T17:04:44.692505Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T17:04:44.692617Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T17:34:43.692100Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T17:34:43.692201Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T18:04:42.691824Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T18:04:42.691934Z DEBUG vector::gcp: Fetching GCP authentication token. ...

bruceg · 2022-05-24T14:53:46Z

Great, thanks for confirming.

RobertSLane · 2022-10-04T08:50:44Z

Hey @bruceg . I appreciate this issues has been closed for some months now, however we are seeing this problem when trying to send data to our GCS bucket from Vector.

We have Vector running in Kubernetes, and we have found that we can send data to GCS for some time after restarting our pods, but after a while we start getting 401s (assuming because Vector has failed to refresh its token). Any possibility this problem has been reintroduced?

The Vector image we are using is timberio/vector:0.24.1-alpine

This is the error we're seeing:
2022-10-04T02:34:38.228190Z ERROR sink{component_kind="sink" component_id=gcs component_type=gcp_cloud_storage component_name=gcs}:request{request_id=214}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

bruceg · 2022-10-11T20:04:53Z

@RobertSLane Do you have healthchecks disabled by chance? The code for these sinks only starts the token refresh after a response from a healthcheck. See #13058 for details.

jkachmar · 2022-11-29T22:05:23Z

i have health checks enabled but i'm seeing this with the GCP Cloud Monitoring sink. it seems to consistently fail with 401 errors after about an hour or so.

i've looked through the code and i guess this is the addition which is present in a couple different sinks (but not, afaict, the cloud monitoring one):

vector/src/sinks/gcp/pubsub.rs

Line 224 in ea17c1f

healthcheck_response(response, auth, HealthcheckError::TopicNotFound.into())

i have a local branch that i'm using with a workaround for #14890, so i'll see if it's not too much to add a similar health check to the cloud monitoring code path as well.

ansel1 · 2023-04-27T16:11:10Z

@jkachmar did you resolve this? We're still seeing this behavior with v0.29.1. After running for an hour, we get token expired errors:

2023-04-27T12:03:47.174065Z ERROR sink{component_kind="sink" component_id=gcp component_type=gcp_stackdriver_metrics component_name=gcp}:request{request_id=6983}: vector::sinks::util::sink: Response failed. response=Response { status: 401, version: HTTP/1.1, headers: {"www-authenticate": "Bearer realm="https://accounts.google.com/\", error="invalid_token"", "vary": "Origin", "vary": "X-Origin", "vary": "Referer", "content-type": "application/json; charset=UTF-8", "transfer-encoding": "chunked", "date": "Thu, 27 Apr 2023 12:03:47 GMT", "server": "ESF", "cache-control": "private", "x-xss-protection": "0", "x-frame-options": "SAMEORIGIN", "x-content-type-options": "nosniff"}, body: b"{\n "error": {\n "code": 401,\n "message": "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\",\n "status": "UNAUTHENTICATED",\n "details": [\n {\n "@type": "type.googleapis.com/google.rpc.ErrorInfo",\n "reason": "ACCESS_TOKEN_EXPIRED",\n "domain": "googleapis.com",\n "metadata": {\n "method": "google.monitoring.v3.MetricService.CreateTimeSeries",\n "service": "monitoring.googleapis.com"\n }\n }\n ]\n }\n}\n" }

spencergilbert · 2023-05-01T15:40:30Z

@ansel1 would you mind filling out a new bug report and fill out the sections in the template?

dacaga added the type: bug A code related bug. label Jan 12, 2022

jszwedko added the provider: gcp Anything `gcp` service provider related label Feb 8, 2022

jszwedko mentioned this issue May 6, 2022

GCP pub/sub issues #12648

Closed

7 tasks

jszwedko added this to the Vector v0.22.0 milestone May 16, 2022

bruceg self-assigned this May 16, 2022

bruceg mentioned this issue May 17, 2022

fix(gcp_pubsub source): Fix handling of auth token #12757

Merged

bruceg closed this as completed May 24, 2022

fraserdarwent mentioned this issue Oct 6, 2022

gcp_cloud_storage sink token refresh failing #13058

Closed

ansel1 mentioned this issue May 1, 2023

401 Unauthorized for GCP Cloud Monitoring sink after an hour #17263

Closed

nkinkade mentioned this issue May 31, 2023

gcp_stackdriver_logs: 401 Unauthorized (ACCESS_TOKEN_EXPIRED) #17559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

401 Unauthorized for GCP related sinks after a while #10828

401 Unauthorized for GCP related sinks after a while #10828

dacaga commented Jan 12, 2022 •

edited

Loading

spencergilbert commented Jan 13, 2022

dacaga commented Jan 13, 2022

spencergilbert commented Jan 13, 2022

dacaga commented Jan 14, 2022

spencergilbert commented Jan 14, 2022

dacaga commented Jan 17, 2022

mloughran commented Apr 22, 2022

jdoupe commented May 6, 2022 •

edited

Loading

bruceg commented May 10, 2022

jdoupe commented May 12, 2022 •

edited

Loading

bruceg commented May 18, 2022

jdoupe commented May 19, 2022 •

edited

Loading

jdoupe commented May 19, 2022

jszwedko commented May 19, 2022

jdoupe commented May 20, 2022

jszwedko commented May 20, 2022 •

edited by spencergilbert

Loading

bruceg commented May 20, 2022

jdoupe commented May 20, 2022

bruceg commented May 20, 2022 •

edited

Loading

jdoupe commented May 22, 2022 •

edited

Loading

bruceg commented May 24, 2022

RobertSLane commented Oct 4, 2022 •

edited

Loading

bruceg commented Oct 11, 2022

jkachmar commented Nov 29, 2022 •

edited

Loading

ansel1 commented Apr 27, 2023

spencergilbert commented May 1, 2023

401 Unauthorized for GCP related sinks after a while #10828

401 Unauthorized for GCP related sinks after a while #10828

Comments

dacaga commented Jan 12, 2022 • edited Loading

Community Note

Vector Version

Vector Configuration File

Actual Behavior

spencergilbert commented Jan 13, 2022

dacaga commented Jan 13, 2022

spencergilbert commented Jan 13, 2022

dacaga commented Jan 14, 2022

spencergilbert commented Jan 14, 2022

dacaga commented Jan 17, 2022

mloughran commented Apr 22, 2022

jdoupe commented May 6, 2022 • edited Loading

bruceg commented May 10, 2022

jdoupe commented May 12, 2022 • edited Loading

bruceg commented May 18, 2022

jdoupe commented May 19, 2022 • edited Loading

jdoupe commented May 19, 2022

jszwedko commented May 19, 2022

jdoupe commented May 20, 2022

jszwedko commented May 20, 2022 • edited by spencergilbert Loading

bruceg commented May 20, 2022

jdoupe commented May 20, 2022

bruceg commented May 20, 2022 • edited Loading

jdoupe commented May 22, 2022 • edited Loading

bruceg commented May 24, 2022

RobertSLane commented Oct 4, 2022 • edited Loading

bruceg commented Oct 11, 2022

jkachmar commented Nov 29, 2022 • edited Loading

ansel1 commented Apr 27, 2023

spencergilbert commented May 1, 2023

dacaga commented Jan 12, 2022 •

edited

Loading

jdoupe commented May 6, 2022 •

edited

Loading

jdoupe commented May 12, 2022 •

edited

Loading

jdoupe commented May 19, 2022 •

edited

Loading

jszwedko commented May 20, 2022 •

edited by spencergilbert

Loading

bruceg commented May 20, 2022 •

edited

Loading

jdoupe commented May 22, 2022 •

edited

Loading

RobertSLane commented Oct 4, 2022 •

edited

Loading

jkachmar commented Nov 29, 2022 •

edited

Loading