Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider expanding the cases where Vector retries requests #10870

Open
jszwedko opened this issue Jan 14, 2022 · 27 comments
Open

Consider expanding the cases where Vector retries requests #10870

jszwedko opened this issue Jan 14, 2022 · 27 comments
Labels
domain: sinks Anything related to the Vector's sinks meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin.

Comments

@jszwedko
Copy link
Member

jszwedko commented Jan 14, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Current behavior

Vector's retry behavior varies by sink, but typically we retry requests that are viewed to be temporary failures that we expect to recover. This includes things like HTTP 503s and 429s. This does not include failures that are viewed as non-temporary like HTTP 403s and 404s. Events in these requests are dropped and Vector continues processing.

Possible issue

The above means that Vector drops events under circumstances like:

  • Misconfiguration in Vector like using an invalid Datadog API key
  • Misconfiguration in upstreams such as a deploy of an internal Elasticsearch instance causing it to start returning 403s where previously the credentials were valid or provider outages like when a GCP outage caused customer load balancers to start returning 404s

Under these circumstances, I think it is reasonable for users to not want Vector to drop events, but just retry and block processing until a human can intervene and fix the issue.

Idea

One idea would be to retry all failures up until a maximum number of retries after which failed batches would be routed to a dead-letter queue (unimplemented).

I'm curious to hear other thoughts here though. This is a nuanced issue.

Refs

@jszwedko jszwedko added meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin. domain: sinks Anything related to the Vector's sinks labels Jan 14, 2022
@jerome-kleinen-kbc-be
Copy link

Next to the retry behavior (number of attempts etc.) perhaps the http status codes to retry could be configurable, of course with sensible defaults. I am struggling to picture how this could look, I guess the configuration will quickly become difficult to read/understand.

I do like your suggestion to route events to a new queue after having failed x retry attempts.

@akutta
Copy link
Contributor

akutta commented Feb 10, 2022

I could see scenarios where dead-letters are useful; however, I'd like to see it configurable.

When using the Elastic sink ( #10839 ) we recently ran into an issue where the response was:

ERROR sink{component_id=out_aws component_kind="sink" component_type=elasticsearch component_name=out_aws}:request{request_id=196897}: vector::sinks::elasticsearch::retry: Response contained errors. response=Response { status: 200, version: HTTP/1.1, headers: {"date": "Mon, 07 Feb 2022 03:11:13 GMT", "content-type": "application/json; charset=UTF-8", "content-length": "1245855", "connection": "keep-alive", "access-control-allow-origin": "*"}, body: b"{\"errors\":true,\"items\":[{\"index\":{\"_id\":null,\"status\":403,\"error\":{\"type\":\"index_create_block_exception\",\"reason\":\"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];\"}}}

In this case, rebalancing shards across our Elastic Domain resolved the underlying issue. We are indexing at a rate of ~150M/min so a dead-letter queue becomes less useful. If instead we applied back-pressure and stopped acking entirely, our messages would have been queued upstream.

Note: This was the internal status. The Request status was a 200.

jszwedko added a commit that referenced this issue Apr 19, 2022
We intend to expand out retried requests to cover a much broader swatch
(likely all requests) as part of
#10870 but
#12220 is blocking a user
from trying out Vector so adding these ahead of time.

Fixes: #12220

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
jszwedko added a commit that referenced this issue Apr 20, 2022
* enhancement(datadog provider): Retry forbidden requests

We intend to expand out retried requests to cover a much broader swatch
(likely all requests) as part of
#10870 but
#12220 is blocking a user
from trying out Vector so adding these ahead of time.

Fixes: #12220

Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
@jszwedko
Copy link
Member Author

More context in #655

@jszwedko
Copy link
Member Author

Per #13130 (comment) we should also see if we can encapsulate this logic to share it for HTTP clients.

@jszwedko
Copy link
Member Author

jszwedko commented Jul 1, 2022

Related: #13414

@kevinpark1217
Copy link
Contributor

@jszwedko On #13414, you mentioned this work might be picked up on Q3. Is there any plan for this to be implemented?

@jszwedko
Copy link
Member Author

Hi @kevinpark1217 ! This is still on our nearterm roadmap, but may not make it in Q3.

@unkempthenry
Copy link

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

@jszwedko
Copy link
Member Author

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

Hey! We plan to write an RFC around this this quarter.

@yoelk
Copy link

yoelk commented Mar 16, 2023

@jszwedko our configuration is Kafka->Filter->S3 with acknowledgements, and like others have mentioned, events are dropped in certain cases. The one case we checked (which was the easiest) is setting the wrong role name in AWS.

I've seen in the description that events would be dropped depending on the HTTP response the sink gets, and whether it's seen as temporary or not.
Until this issue is resolved, I'd like to evaluate the risk for our clients' data, depending on each scenario (wrong password being one of them).

Is there a place where I can see the different cases? and which ones will result un dropped events?
Thanks a lot!

@jszwedko
Copy link
Member Author

I responded in Discord, but unfortunately doing this sort of survey would mean spelunking through the source code.

@cameronbraid
Copy link
Contributor

Is this still on the roadmap?

@jszwedko
Copy link
Member Author

Is this still on the roadmap?

It's definitely still on our radar as high impact area to improve, but probably nothing happening on this before the end of the year given competing priorities. We are open to seeing PRs addressing this for individual sinks though. We've seen some already 🙂

@yoelk
Copy link

yoelk commented Oct 11, 2023

@jszwedko Can you please point out the PRs you mentioned?
If the effort is not big, we might be interested in creating such PRs for aws_s3 and azure_blob sinks.
Would be great to have a reference for such an effort 🙏

@jszwedko
Copy link
Member Author

Hi @yoelk ,

That would great! Thanks for the interest! Here's a couple of examples:

For the two sinks you mentioned it would mean updating:

  • vector/src/aws/mod.rs

    Lines 42 to 53 in 67c4beb

    pub fn is_retriable_error<T>(error: &SdkError<T>) -> bool {
    match error {
    SdkError::TimeoutError(_) | SdkError::DispatchFailure(_) => true,
    SdkError::ConstructionFailure(_) => false,
    SdkError::ResponseError(err) => check_response(err.raw()),
    SdkError::ServiceError(err) => check_response(err.raw()),
    _ => {
    warn!("AWS returned unknown error, retrying request.");
    true
    }
    }
    }
    or
    impl RetryLogic for S3RetryLogic {
    type Error = SdkError<PutObjectError>;
    type Response = S3Response;
    fn is_retriable_error(&self, error: &Self::Error) -> bool {
    is_retriable_error(error)
    }
    }
    for the aws_s3 sink
  • impl RetryLogic for AzureBlobRetryLogic {
    type Error = HttpError;
    type Response = AzureBlobResponse;
    fn is_retriable_error(&self, error: &Self::Error) -> bool {
    error.status().is_server_error()
    || StatusCode::TOO_MANY_REQUESTS.as_u16() == Into::<u16>::into(error.status())
    }
    }
    for the azure_blob_storage sink

Hopefully this helps get you going in the right directions. Just let us know if you need some more pointers!

@jiaozi07
Copy link

elasticsearch Is this still on the roadmap?

@suikast42
Copy link

The same is if you use journal_d source and nats sink. Failed logs to send nats are acked by vector.

@danielcollishaw
Copy link

danielcollishaw commented May 9, 2024

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

@jszwedko
Copy link
Member Author

jszwedko commented May 9, 2024

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

No worries! We are still chipping away at this as we go, but we haven't been able to make a concerted effort.

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

@danielcollishaw
Copy link

danielcollishaw commented May 9, 2024

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

Sorry I should have been a bit more descriptive onto why they can expire, we assume the STS role elsewhere and push the returned credentials to the constrictive hosts. Sometimes this push can lag out of sync, which is why we are using a buffer period on the duration of the role.

Thank you for the ongoing progress and the update on the roadmap, appreciate the speedy response!

@jszwedko
Copy link
Member Author

jszwedko commented May 9, 2024

Aha I see. It looks like the AWS components don't currently retry failed credentials requests, but I think it'd be a relatively straightforward change to https://github.com/vectordotdev/vector/blob/master/src/aws/mod.rs if you (or others) are interested.

@danielcollishaw
Copy link

I actually have a period of time next week that I may be able to allocate into this. Will look into contribution docs!

@coutug
Copy link

coutug commented Jun 14, 2024

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

@jszwedko
Copy link
Member Author

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

Would it be possible to have the ingress return a 503 instead (no upstream available?). Otherwise, I think we'd accept a contribution to the http sink to retry 404s.

@coutug
Copy link

coutug commented Jun 17, 2024

Ah yes, found a way to return the proper status code from the ingress controller. Thanks a lot for the help!

@noble-varghese
Copy link
Contributor

Hey @jszwedko have added an enhancement on the http sink to retry on some of the 4XXs :)

@kutzi
Copy link

kutzi commented Nov 28, 2024

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

We're running Splunk on-prem and the Splunk HEC endpoint is returning 404s for a short while, whenever we restart an indexer. Would be happy to see the Splunk HEC sink included here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: sinks Anything related to the Vector's sinks meta: idea Anything in the idea phase. Needs further discussion and consensus before work can begin.
Projects
None yet
Development

No branches or pull requests