-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] Add timeout to GetObjectRequest for s3 input #15590
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, I understand you are still working on testing this one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is in progress, but I was taking a look and I have a couple of questions.
The new context inputCtx
with its goroutine for cancelation doesn't smell well to me, why wasn't the previous p.context
enough?
after the default timeout 2 minute is hit, this specific S3 object will be skipped, SQS message will return back to the queue later. So Filebeat can try to read it again later.
Does it mean that for big objects that can take more than 2 minutes to download, the timeout is hit and then Filebeat retries with the same object? Does filebeat keep the last offset or something so it doesn't keep continuously retrying?
Yes,
@jsoriano I haven't seen any big objects request more than even 1 minute to download. I think the problem seeing in the issue is caused by resource leak from not having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection reset error happened when making GetObjectRequest API call which is one step before actual reading the log file. So if that failed, the SQS message goes back into the queue and the same S3 object will be retried with GetObjectRequest later after visibility timeout is done.
Oh ok, so then this timeout and the retries don't affect the actual download of the log file, right? If that is the case then it LGTM.
Thanks for addressing all the changes!
@kaiyan-sheng Since the issue this fixes is labeled as a bug, should this fix be backported to |
@ycombinator Yes I agree, just in case if there's a 7.5.3. I will create the backport right now. Thank you! |
) * Add timeout to GetObjectRequest which will cancel the request if it takes too long * Close resp.Body from S3 GetObject API to prevent resource leak * Change aws_api_timeout to api_timeout (cherry picked from commit 86c3e63)
… for s3 input (#15908) * [Filebeat] Add timeout to GetObjectRequest for s3 input (#15590) * Add timeout to GetObjectRequest which will cancel the request if it takes too long * Close resp.Body from S3 GetObject API to prevent resource leak * Change aws_api_timeout to api_timeout (cherry picked from commit 86c3e63) * update changelog * Add default value in manifest.yml
) * Add timeout to GetObjectRequest which will cancel the request if it takes too long * Close resp.Body from S3 GetObject API to prevent resource leak * Change aws_api_timeout to api_timeout (cherry picked from commit 86c3e63)
…Request for s3 input (elastic#15908) * [Filebeat] Add timeout to GetObjectRequest for s3 input (elastic#15590) * Add timeout to GetObjectRequest which will cancel the request if it takes too long * Close resp.Body from S3 GetObject API to prevent resource leak * Change aws_api_timeout to api_timeout (cherry picked from commit cf7b92f) * update changelog * Add default value in manifest.yml
Problem we see when using s3 input:
When using s3 input to read logs from S3 bucket, after a while with high amount of logs
read: connection reset by peer
error showed up. This error is triggered byreader.ReadString
function, thenprocessorKeepAlive
found it's taking too long to runprocessMessage
, which is longer than half of the set visibility timeout. SochangeVisibilityTimeout
function keep getting called repeatedly.This PR is to add timeout into GetObjectRequest API call by using context pattern to implement timeout logic that will cancel the request if it takes too long. This way, after the default timeout 2 minute is hit, this specific S3 object will be skipped, SQS message will return back to the queue later. So Filebeat can try to read it again later.
I decided to add a config option called
context_timeout
for s3 input because based on your visibility_timeout value, context_timeout can be as large as half of the visibility_timeout. This will allow users to modify both timeout values when using s3 input or filebeat aws module with larger s3 objects or smaller network bandwidth.closes #15502