Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

401 Unauthorized for GCP related sinks after a while #10828

Closed
Tracked by #12648
dacaga opened this issue Jan 12, 2022 · 26 comments
Closed
Tracked by #12648

401 Unauthorized for GCP related sinks after a while #10828

dacaga opened this issue Jan 12, 2022 · 26 comments
Assignees
Labels
provider: gcp Anything `gcp` service provider related type: bug A code related bug.

Comments

@dacaga
Copy link

dacaga commented Jan 12, 2022

After executing properly for a while on a GKE pod, both of my GCP related sinks (Stackdriver and Google Cloud Storage) start giving me 401 Unauthorized errors.

The underlying librairies GcpAuthConfig/GcpCredentials don't seem to be handling JWT access token renewal/refetch from the Metadata Server properly (since I'm not using a service account private key).

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Vector Version

From the Docker image timberio/vector:0.18.1-debian

0.18.1

Vector Configuration File

[sinks.stackdriver]
type = "gcp_stackdriver_logs"
inputs = [ "stdin" ]
log_id = "logid"
project_id = "projectid"

[sinks.stackdriver.resource]
type = "global"

[sinks.gcs]
type = "gcp_cloud_storage"
inputs = [ "stdin" ]
bucket = "bucket"
key_prefix = "%F/"
compression = "gzip"
storage_class = "COLDLINE"

[sinks.gcs.encoding]
codec = "ndjson"

Actual Behavior

After a few hours of execution, the JWT access token expires and every time Vector tries to flush to its GCP related sinks I get this from stderr:

ERROR sink{component_kind="sink" component_id=gcs component_type=gcp_cloud_storage component_name=gcs}:request{request_id=6}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

...

ERROR sink{component_kind="sink" component_id=stackdriver component_type=gcp_stackdriver_logs component_name=stackdriver}:request{request_id=192}: vector::sinks::util::sink: Response failed. response=Response { status: 401, version: HTTP/1.1, headers: {"www-authenticate": "Bearer realm=\"https://accounts.google.com/\", error=\"invalid_token\"", "vary": "X-Origin", "vary": "Referer", "vary": "Origin,Accept-Encoding", "content-type": "application/json; charset=UTF-8", "date": "Mon, 10 Jan 2022 00:10:14 GMT", "server": "ESF", "cache-control": "private", "x-xss-protection": "0", "x-frame-options": "SAMEORIGIN", "x-content-type-options": "nosniff", "accept-ranges": "none", "transfer-encoding": "chunked"}, body: b"{\n  \"error\": {\n    \"code\": 401,\n    \"message\": \"Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\",\n    \"status\": \"UNAUTHENTICATED\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n        \"reason\": \"ACCESS_TOKEN_EXPIRED\",\n        \"domain\": \"googleapis.com\",\n        \"metadata\": {\n          \"method\": \"google.logging.v2.LoggingServiceV2.WriteLogEntries\",\n          \"service\": \"logging.googleapis.com\"\n        }\n      }\n    ]\n  }\n}\n" }
@dacaga dacaga added the type: bug A code related bug. label Jan 12, 2022
@spencergilbert
Copy link
Contributor

Is the auth tied to the ServiceAccount being used with the Vector pod? What version of Kubernetes are you running?

@dacaga
Copy link
Author

dacaga commented Jan 13, 2022

Is the auth tied to the ServiceAccount being used with the Vector pod?

Yes and I double checked, the owner of the created objects on the GCS bucket is the pod's ServiceAccount.

What version of Kubernetes are you running?

1.18.20-gke.4501

@spencergilbert
Copy link
Contributor

I'm wondering if it's similar to #8616 (comment), seems like Kubernetes has added some rotation times to service accounts.

@dacaga
Copy link
Author

dacaga commented Jan 14, 2022

I don't think it's related since, in this case, the GCS and Stackdriver sinks are authenticating using a token grabbed from the GKE Metadata Server here:

async fn get_token_implicit() -> Result<Token, GcpError> {
let req = http::Request::get(SERVICE_ACCOUNT_TOKEN_URL)
.header("Metadata-Flavor", "Google")
.body(hyper::Body::empty())
.unwrap();
let proxy = ProxyConfig::from_env();
let res = HttpClient::new(None, &proxy)
.context(BuildHttpClientSnafu)?
.send(req)
.await
.context(GetImplicitTokenSnafu)?;
let body = res.into_body();
let bytes = hyper::body::to_bytes(body)
.await
.context(GetTokenBytesSnafu)?;
// Token::from_str is irresponsible and may panic!
match serde_json::from_slice::<Token>(&bytes) {
Ok(token) => Ok(token),
Err(error) => Err(match serde_json::from_slice::<TokenErr>(&bytes) {
Ok(error) => GcpError::TokenFromJson { source: error },
Err(_) => GcpError::TokenJsonFromStr { source: error },
}),
}
}

It's kept up-to-date with the task:

vector/src/sinks/gcp/mod.rs

Lines 131 to 150 in ea0d002

pub fn spawn_regenerate_token(&self) {
let this = self.clone();
let period = this.token.read().unwrap().expires_in() as u64 / 2;
let interval = IntervalStream::new(tokio::time::interval(Duration::from_secs(period)));
let task = interval.for_each(move |_| {
let this = this.clone();
async move {
debug!("Renewing GCP authentication token.");
if let Err(error) = this.regenerate_token().await {
error!(
message = "Failed to update GCP authentication token.",
%error
);
}
}
});
tokio::spawn(task);
}
}

FYI: It has now been running in production for 2 days without any 401 error. It makes me think that it might be related to some sporadic and uncaught error with the Metadata Server.

Also, prior to the 401 errors I had, there are no signs in the logs that there had been a problem with the token renewal (nothing in the logs that says "Failed to update GCP authentication token.")

@spencergilbert
Copy link
Contributor

Ah, gotcha - wasn't sure what it got tied to for auth. We could look to add more logging into the regen process if it comes up again, otherwise I'm not sure there's a good course of action.

@dacaga
Copy link
Author

dacaga commented Jan 17, 2022

My thoughts too. It has not happened again yet. I'll keep en eye on it. You can close the issue if you want and I'll reopen if it happens again and I manage to catch more data.

@jszwedko jszwedko added the provider: gcp Anything `gcp` service provider related label Feb 8, 2022
@mloughran
Copy link

I am also seeing this sporadically. Scanning through vector logs this seems to recover automatically after < 1 minute, however I just experienced a 15 minute log outage on 1 VM, which again automatically recovered.

Running vector 0.19.1 on debian buster.

@jdoupe
Copy link

jdoupe commented May 6, 2022

I am having this issue as well. At first it seemed to happen hourly and I thought it might be related to token expiration, but now sometimes it'll run for several hours.

2022-05-06T00:30:17.542658Z ERROR sink{component_kind="sink" component_id=mylogs component_type=gcp_cloud_storage component_name=mylogs}:request{request_id=709}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"
vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

To note: No k8s involved, and no recovery is experienced. Once it starts, it never clears up without a restart.

@jszwedko jszwedko mentioned this issue May 6, 2022
7 tasks
@bruceg
Copy link
Member

bruceg commented May 10, 2022

I think this should be fixed by #12645, which should be in tonight's nightly build. Could you test and see if this is still happening for you?

@jdoupe
Copy link

jdoupe commented May 12, 2022

Unfortunately it still occurred.

...
2022-05-11T21:31:02.875038Z  INFO vector: Vector has started. debug="false" version="0.22.0" arch="x86_64" build_id="e584ef2 2022-05-11"
...
2022-05-11T22:32:52.009788Z ERROR sink{component_kind="sink" component_id=mylogs component_type=gcp_cloud_storage component_name=gcs_doppler_logs}:request{request_id=12781}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

Update:
It now consistently fails upon the first renewal attempt (after 1 hour).

@bruceg
Copy link
Member

bruceg commented May 18, 2022

I think this has been fixed by #12757 which was just merged, at least I was unable to reproduce the problem after several hours of continuous running. Are you set up to build from sources to test if this fixes the problem for you?

@jdoupe
Copy link

jdoupe commented May 19, 2022

🤞 I've built and have it running. I'll let it run overnight and post any updates.

Update: didn't make it past the first hour. :(

@jdoupe
Copy link

jdoupe commented May 19, 2022

In case you were wondering about how how much data we have flowing through, attached is a screenshot of vector top after an hour, right after the token goes bad (not that this indicates anything)

image

I ran today's nightly with debug, and there's no mention of it even trying to renew the token.

@jszwedko
Copy link
Member

@jdoupe which nightly did you run with (what does vector --version say?)? The nightly build actually failed due to a network issue so it was only published ~7 hours ago when I reran it.

@jdoupe
Copy link

jdoupe commented May 20, 2022

vector 0.22.0 (x86_64-unknown-linux-gnu 50dd7a7 2022-05-19)

@jszwedko
Copy link
Member

jszwedko commented May 20, 2022

Thanks for confirming @jdoupe . That should be after #12757 so it seems like maybe something is still amiss here. cc/ @bruceg (noting this is for the gcp_cloud_storage sink).

@bruceg
Copy link
Member

bruceg commented May 20, 2022

I have set up a test configs sending to both gcp_pubsub and gcp_cloud_storage and have yet to reproduce this problem. I added a debug statement in the code and verified that the refresh is happening. I'm not sure how to diagnose this further.

@jdoupe
Copy link

jdoupe commented May 20, 2022

Should I see the renewal debug output with the current nightly? (With just a single -v?)

@bruceg
Copy link
Member

bruceg commented May 20, 2022

No, I will be submitting a PR for adding the details, as it would be useful for future diagnosis. Edit: See #12814

@jdoupe
Copy link

jdoupe commented May 22, 2022

@bruceg - Thanks much for all your time on this...

Running vector 0.22.0 (x86_64-unknown-linux-gnu b002bfb 2022-05-22) for a few hours now...

Things looking good thus far. Just noting that it looks like the implicit renewal is never called.

$ grep -i "vector::gcp" vector.log.0.22.0.b002bfb.t8v
2022-05-22T15:34:47.368968Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T16:04:46.691864Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T16:04:46.691958Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T16:34:45.692488Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T16:34:45.692595Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T17:04:44.692505Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T17:04:44.692617Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T17:34:43.692100Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T17:34:43.692201Z DEBUG vector::gcp: Fetching GCP authentication token. ...
2022-05-22T18:04:42.691824Z DEBUG vector::gcp: Renewing GCP authentication token.
2022-05-22T18:04:42.691934Z DEBUG vector::gcp: Fetching GCP authentication token. ...

@bruceg
Copy link
Member

bruceg commented May 24, 2022

Great, thanks for confirming.

@bruceg bruceg closed this as completed May 24, 2022
@RobertSLane
Copy link

RobertSLane commented Oct 4, 2022

Hey @bruceg . I appreciate this issues has been closed for some months now, however we are seeing this problem when trying to send data to our GCS bucket from Vector.

We have Vector running in Kubernetes, and we have found that we can send data to GCS for some time after restarting our pods, but after a while we start getting 401s (assuming because Vector has failed to refresh its token). Any possibility this problem has been reintroduced?

The Vector image we are using is timberio/vector:0.24.1-alpine

This is the error we're seeing:
2022-10-04T02:34:38.228190Z ERROR sink{component_kind="sink" component_id=gcs component_type=gcp_cloud_storage component_name=gcs}:request{request_id=214}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 401 Unauthorized"

@bruceg
Copy link
Member

bruceg commented Oct 11, 2022

@RobertSLane Do you have healthchecks disabled by chance? The code for these sinks only starts the token refresh after a response from a healthcheck. See #13058 for details.

@jkachmar
Copy link

jkachmar commented Nov 29, 2022

i have health checks enabled but i'm seeing this with the GCP Cloud Monitoring sink. it seems to consistently fail with 401 errors after about an hour or so.

i've looked through the code and i guess this is the addition which is present in a couple different sinks (but not, afaict, the cloud monitoring one):

healthcheck_response(response, auth, HealthcheckError::TopicNotFound.into())

i have a local branch that i'm using with a workaround for #14890, so i'll see if it's not too much to add a similar health check to the cloud monitoring code path as well.

@ansel1
Copy link

ansel1 commented Apr 27, 2023

@jkachmar did you resolve this? We're still seeing this behavior with v0.29.1. After running for an hour, we get token expired errors:

2023-04-27T12:03:47.174065Z ERROR sink{component_kind="sink" component_id=gcp component_type=gcp_stackdriver_metrics component_name=gcp}:request{request_id=6983}: vector::sinks::util::sink: Response failed. response=Response { status: 401, version: HTTP/1.1, headers: {"www-authenticate": "Bearer realm="https://accounts.google.com/\", error="invalid_token"", "vary": "Origin", "vary": "X-Origin", "vary": "Referer", "content-type": "application/json; charset=UTF-8", "transfer-encoding": "chunked", "date": "Thu, 27 Apr 2023 12:03:47 GMT", "server": "ESF", "cache-control": "private", "x-xss-protection": "0", "x-frame-options": "SAMEORIGIN", "x-content-type-options": "nosniff"}, body: b"{\n "error": {\n "code": 401,\n "message": "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.\",\n "status": "UNAUTHENTICATED",\n "details": [\n {\n "@type": "type.googleapis.com/google.rpc.ErrorInfo",\n "reason": "ACCESS_TOKEN_EXPIRED",\n "domain": "googleapis.com",\n "metadata": {\n "method": "google.monitoring.v3.MetricService.CreateTimeSeries",\n "service": "monitoring.googleapis.com"\n }\n }\n ]\n }\n}\n" }

@spencergilbert
Copy link
Contributor

@ansel1 would you mind filling out a new bug report and fill out the sections in the template?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider: gcp Anything `gcp` service provider related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

9 participants