-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kinesis sinks): implement full retry of partial failures in firehose/streams #16771
feat(kinesis sinks): implement full retry of partial failures in firehose/streams #16771
Conversation
✅ Deploy Preview for vrl-playground canceled.
|
✅ Deploy Preview for vector-project canceled.
|
@@ -58,6 +58,13 @@ pub struct KinesisSinkBaseConfig { | |||
#[serde(default)] | |||
pub auth: AwsAuthentication, | |||
|
|||
/// Whether or not to retry successful requests containing partial failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want per-sink config. This is in the "base" only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 especially if we're only supporting it for streams.
let msg = format!("partial error count {}", response.failure_count); | ||
return RetryAction::Retry(msg.into()); | ||
} else { | ||
RetryAction::DontRetry("ok".into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these contain anything different? New to the project.
fn should_retry_response(&self, response: &Self::Response) -> RetryAction { | ||
if self.retry_partial && response.failure_count > 0 { | ||
let msg = format!("partial error count {}", response.failure_count); | ||
return RetryAction::Retry(msg.into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixme: doesn't need a return
.map(|output: PutRecordBatchOutput| KinesisResponse { | ||
count: rec_count, | ||
failure_count: output.failed_put_count().unwrap_or(0) as usize, | ||
events_byte_size: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the events size isn't available here. Wasn't sure the best way to modify this - will think about it. I may just return the failure count for now, and build the KinesisResponse
in the Service
Thanks @jasongoodwin! I'm hoping to have this reviewed by tomorrow, appreciate the pre-review you already provided 😄 |
Yeah it needs a few things likely - but what I'm really hoping to accomplish is to do partial retry - for firehose this pr could potentially create a lot of duplication which is a risk. Streams is okay as it can deduplicate. If you can give some insight into how i might implement partial retry, would defo appreciate it. |
I'll try to clean this up a bit over the weekend after reviewing/thinking about it a bit. I closed the related pr. #16703 |
I failed to have this reviewed, but I'll do my best to leave my thoughts before you take another look this weekend |
It's okay - I can see some things clean up after sitting on it. Only thing really at this point is:
|
Sorry for the delay - I'm planning on dedicating a chunk of time Wednesday on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what do you think about the configuration?
I think the configuration is fine. If we don't implement for firehose we can drop the note, and if there is a suggestion for how to protect against dedupes for streams - we could include that suggestion (as we do for Elasticsearch).
- should this even be implemented for firehose? What's the risk of duplication and egregious expense on a lot of retries?
I think if there's no way for firehose to de-dupe these we should only implement it for streams (which can de-dupe?).
@@ -58,6 +58,13 @@ pub struct KinesisSinkBaseConfig { | |||
#[serde(default)] | |||
pub auth: AwsAuthentication, | |||
|
|||
/// Whether or not to retry successful requests containing partial failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 especially if we're only supporting it for streams.
|
||
fn should_retry_response(&self, response: &Self::Response) -> RetryAction { | ||
if self.retry_partial && response.failure_count > 0 { | ||
let msg = format!("partial error count {}", response.failure_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to include the error type and reason if we can pull that out of the response reasonably.
@@ -1,3 +1,5 @@ | |||
use crate::sinks::aws_kinesis::KinesisResponse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be moved into the use super::
line below.
.map(|output: PutRecordsOutput| KinesisResponse { | ||
count: rec_count, | ||
failure_count: output.failed_record_count().unwrap_or(0) as usize, | ||
events_byte_size: 0, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely feels better to me to do this in the service.rs
Great thanks for the review! I'll have to rebuild some context to fix this up, but I think we can get this over the line. |
Hey @jasongoodwin - just noticed this was still hanging around, wanted to check in and see how things were going. |
…hose/streams (#17535) This PR is from #16771 PR. Refactor some action checking. closes: #17424 --------- Signed-off-by: Spencer Gilbert <spencer.gilbert@datadoghq.com> Co-authored-by: Jason Goodwin <jgoodwin@bluecatnetworks.com> Co-authored-by: Jason Goodwin <jay.michael.goodwin@gmail.com> Co-authored-by: Spencer Gilbert <spencer.gilbert@datadoghq.com>
Superceded by #17535 |
I have to test this, but looks okay. It's a little different than the ES implementation, but the change is fairly minimal.
Only wart is the way that the request size is "augmented" into the KinesisResponse after the call to
call
.Hopefully looks alright - feel free to drop any feedback and I'll fix it up.
I think the config per each of streams/firehose is likely necessary to release this.
I reviewed/marked up the PR to help draw attention to these items and clarify.
--------- Further discussion/tickets -------
Handling of partial failures for firehose is open here: #359
Note that this issues is just the whole retry similar to what ES does (#140
)
I'd happily talk to someone about how I might improve this to handle only the partial failures (@decklyndubs on Discord - I'm in your server.) I have to look a little deeper but my fear in doing that is that a record gets indefinitely stuck and should be dropped, so in my mind there may need to be some separate retry policy in the sink. This design issue is broad and relates to multiple sinks such as ES that has partial failures.
Some other related tickets.
#7659
#9451
#9861