[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

kwinstonix · 2020-07-14T15:15:27Z

Describe the enhancement:

Sometimes there are the error logs of failed processing object, then the SQS message is put back to SQS queue. Some lines of the object have been forwarded to output，when the object is processed again there are duplicated docs in ElasticSearch. So is there any way of keeping track of object offset? when the object is processed multiple times, we can start from the last offset to avoid duplicated docs in ElasticSearch. This is just like the behavior of reading log files.

2020-07-14T21:59:12.633+0800	ERROR	[s3]	s3/input.go:491	ReadString failed: context deadline exceeded
2020-07-14T21:59:12.633+0800	ERROR	[s3]	s3/input.go:395	createEventsFromS3Info failed processing file from s3 bucket "my-bucket-test" with name "xxxxxxxxxxxx.log.gz": ReadString failed: context deadline exceeded

Describe a specific use case for the enhancement or feature:

WHEN: processing is failed or filebeat shutdown
TO DO: keep track of object offset to void duplicated docs in ElasticSearch

Some feasible solutions

calculate hash of the log line of the object, then set the ElasticSearch doc _id with the hash value, but this smpacts write performance
when the processing is failed or interrupted, set the offset the the s3 object metatadata x-amz-meta-filebeat-offset=$successed_processing_offset
filebeat check that metadata of the object, then starts reading from $successed_processing_offset position

It is important to keep track of s3 object offset that has beed processed successfully , because SQS message could be processed by multiple Filebeat instances.

filebeat doc:

Multiple Filebeat instances can read from the same SQS queues at the same time. To horizontally scale processing when there are large amounts of log data flowing into an S3 bucket, you can run multiple Filebeat instances that read from the same SQS queues at the same time. No additional configuration is required.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-15T05:52:46Z

Pinging @elastic/integrations-platforms (Team:Platforms)

kaiyan-sheng · 2020-07-15T15:59:30Z

Hi @kwinstonix thanks for creating this issue. Do you know if there is a way to reproduce this problem? Looking at the code, we are setting eventID based on objectHash and offset(https://github.com/elastic/beats/blob/master/x-pack/filebeat/input/s3/input.go#L638). When a file failed to process, the SQS message should go back to the queue and then the whole/same file would be re-processed later. This way, I think(in theory) the objectHash and offset both should be the same as last time. This is how we are thinking to solve the duplicate events issue. With what you are seeing, does that mean either the objectHash or the offset value for individual log entry changed during the second time processing this same file? TIA!!

kwinstonix · 2020-07-15T17:25:09Z

Thanks for the reply, I see the event _id in filebeat event, but i use Kafka output instead of Elastic Search and I can not control Kafka consumer logstash config, so I can't set the doc Id in ElasticSearch. On the other hand using self-generate doc _id impact Elastic Search performance. Hash Id is a solution indeed, I just wonder whether there is a way to set the offset of s3 object🤔

kwinstonix · 2020-07-15T17:32:43Z

Only when the input is local file, the offset position is recorded into registry. Is it right?

kaiyan-sheng · 2020-07-15T19:11:00Z

Ahh I think I found a bug in the code about offset. Will fix that soon!

kaiyan-sheng · 2020-07-16T14:04:15Z

This should be fixed as a part of #19962, still in progress and need more testing.

kaiyan-sheng · 2020-07-22T13:29:01Z

Hi @kwinstonix, #19962 is merged into master branch. I think that will fix this issue. I will close this issue for now and if you get a chance to test it and still seeing duplicate events, please feel free to open a new issue or reopen this one! Thank you!!

kaiyan-sheng · 2020-07-31T01:15:36Z

More fix on the offset: #20370

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 14, 2020

andresrc added enhancement Team:Platforms Label for the Integrations - Platforms team labels Jul 15, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 15, 2020

andresrc added the [zube]: Inbox label Jul 15, 2020

kaiyan-sheng self-assigned this Jul 15, 2020

andresrc removed the [zube]: Inbox label Jul 17, 2020

kaiyan-sheng closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

kwinstonix commented Jul 14, 2020 •

edited

Loading

elasticmachine commented Jul 15, 2020

kaiyan-sheng commented Jul 15, 2020

kwinstonix commented Jul 15, 2020 •

edited

Loading

kwinstonix commented Jul 15, 2020

kaiyan-sheng commented Jul 15, 2020

kaiyan-sheng commented Jul 16, 2020

kaiyan-sheng commented Jul 22, 2020

kaiyan-sheng commented Jul 31, 2020

[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

Comments

kwinstonix commented Jul 14, 2020 • edited Loading

elasticmachine commented Jul 15, 2020

kaiyan-sheng commented Jul 15, 2020

kwinstonix commented Jul 15, 2020 • edited Loading

kwinstonix commented Jul 15, 2020

kaiyan-sheng commented Jul 15, 2020

kaiyan-sheng commented Jul 16, 2020

kaiyan-sheng commented Jul 22, 2020

kaiyan-sheng commented Jul 31, 2020

kwinstonix commented Jul 14, 2020 •

edited

Loading

kwinstonix commented Jul 15, 2020 •

edited

Loading