-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider expanding the cases where Vector retries requests #10870
Comments
Next to the retry behavior (number of attempts etc.) perhaps the http status codes to retry could be configurable, of course with sensible defaults. I am struggling to picture how this could look, I guess the configuration will quickly become difficult to read/understand. I do like your suggestion to route events to a new queue after having failed x retry attempts. |
I could see scenarios where dead-letters are useful; however, I'd like to see it configurable. When using the Elastic sink ( #10839 ) we recently ran into an issue where the response was:
In this case, rebalancing shards across our Elastic Domain resolved the underlying issue. We are indexing at a rate of ~150M/min so a dead-letter queue becomes less useful. If instead we applied back-pressure and stopped acking entirely, our messages would have been queued upstream. Note: This was the internal status. The Request status was a 200. |
* enhancement(datadog provider): Retry forbidden requests We intend to expand out retried requests to cover a much broader swatch (likely all requests) as part of #10870 but #12220 is blocking a user from trying out Vector so adding these ahead of time. Fixes: #12220 Signed-off-by: Jesse Szwedko <jesse@szwedko.me>
More context in #655 |
Per #13130 (comment) we should also see if we can encapsulate this logic to share it for HTTP clients. |
Related: #13414 |
Hi @kevinpark1217 ! This is still on our nearterm roadmap, but may not make it in Q3. |
@jszwedko our configuration is Kafka->Filter->S3 with acknowledgements, and like others have mentioned, events are dropped in certain cases. The one case we checked (which was the easiest) is setting the wrong role name in AWS. I've seen in the description that events would be dropped depending on the HTTP response the sink gets, and whether it's seen as temporary or not. Is there a place where I can see the different cases? and which ones will result un dropped events? |
I responded in Discord, but unfortunately doing this sort of survey would mean spelunking through the source code. |
Is this still on the roadmap? |
It's definitely still on our radar as high impact area to improve, but probably nothing happening on this before the end of the year given competing priorities. We are open to seeing PRs addressing this for individual sinks though. We've seen some already 🙂 |
@jszwedko Can you please point out the PRs you mentioned? |
Hi @yoelk , That would great! Thanks for the interest! Here's a couple of examples:
For the two sinks you mentioned it would mean updating:
Hopefully this helps get you going in the right directions. Just let us know if you need some more pointers! |
elasticsearch Is this still on the roadmap? |
The same is if you use journal_d source and nats sink. Failed logs to send nats are acked by vector. |
Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map? I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period. We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way. |
No worries! We are still chipping away at this as we go, but we haven't been able to make a concerted effort. For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s. |
Sorry I should have been a bit more descriptive onto why they can expire, we assume the STS role elsewhere and push the returned credentials to the constrictive hosts. Sometimes this push can lag out of sync, which is why we are using a buffer period on the duration of the role. Thank you for the ongoing progress and the update on the roadmap, appreciate the speedy response! |
Aha I see. It looks like the AWS components don't currently retry failed credentials requests, but I think it'd be a relatively straightforward change to https://github.com/vectordotdev/vector/blob/master/src/aws/mod.rs if you (or others) are interested. |
I actually have a period of time next week that I may be able to allocate into this. Will look into contribution docs! |
Hey @jszwedko, just hit the issue with 404 response for the We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation? |
Would it be possible to have the ingress return a 503 instead (no upstream available?). Otherwise, I think we'd accept a contribution to the |
Ah yes, found a way to return the proper status code from the ingress controller. Thanks a lot for the help! |
Hey @jszwedko have added an enhancement on the http sink to retry on some of the 4XXs :) |
We're running Splunk on-prem and the Splunk HEC endpoint is returning 404s for a short while, whenever we restart an indexer. Would be happy to see the Splunk HEC sink included here as well. |
Community Note
Current behavior
Vector's retry behavior varies by sink, but typically we retry requests that are viewed to be temporary failures that we expect to recover. This includes things like HTTP 503s and 429s. This does not include failures that are viewed as non-temporary like HTTP 403s and 404s. Events in these requests are dropped and Vector continues processing.
Possible issue
The above means that Vector drops events under circumstances like:
Under these circumstances, I think it is reasonable for users to not want Vector to drop events, but just retry and block processing until a human can intervene and fix the issue.
Idea
One idea would be to retry all failures up until a maximum number of retries after which failed batches would be routed to a dead-letter queue (unimplemented).
I'm curious to hear other thoughts here though. This is a nuanced issue.
Refs
The text was updated successfully, but these errors were encountered: