Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] [S3 input] Failed processing of object leads to duplicated events #19901

Closed
kwinstonix opened this issue Jul 14, 2020 · 8 comments
Closed
Assignees
Labels
enhancement Team:Platforms Label for the Integrations - Platforms team

Comments

@kwinstonix
Copy link
Contributor

kwinstonix commented Jul 14, 2020

Describe the enhancement:

Sometimes there are the error logs of failed processing object, then the SQS message is put back to SQS queue. Some lines of the object have been forwarded to output,when the object is processed again there are duplicated docs in ElasticSearch. So is there any way of keeping track of object offset? when the object is processed multiple times, we can start from the last offset to avoid duplicated docs in ElasticSearch. This is just like the behavior of reading log files.

2020-07-14T21:59:12.633+0800	ERROR	[s3]	s3/input.go:491	ReadString failed: context deadline exceeded
2020-07-14T21:59:12.633+0800	ERROR	[s3]	s3/input.go:395	createEventsFromS3Info failed processing file from s3 bucket "my-bucket-test" with name "xxxxxxxxxxxx.log.gz": ReadString failed: context deadline exceeded

Describe a specific use case for the enhancement or feature:

WHEN: processing is failed or filebeat shutdown
TO DO: keep track of object offset to void duplicated docs in ElasticSearch

Some feasible solutions

  • calculate hash of the log line of the object, then set the ElasticSearch doc _id with the hash value, but this smpacts write performance
  • when the processing is failed or interrupted, set the offset the the s3 object metatadata x-amz-meta-filebeat-offset=$successed_processing_offset
  • filebeat check that metadata of the object, then starts reading from $successed_processing_offset position

It is important to keep track of s3 object offset that has beed processed successfully , because SQS message could be processed by multiple Filebeat instances.

filebeat doc:

Multiple Filebeat instances can read from the same SQS queues at the same time. To horizontally scale processing when there are large amounts of log data flowing into an S3 bucket, you can run multiple Filebeat instances that read from the same SQS queues at the same time. No additional configuration is required.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 14, 2020
@andresrc andresrc added enhancement Team:Platforms Label for the Integrations - Platforms team labels Jul 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 15, 2020
@kaiyan-sheng
Copy link
Contributor

Hi @kwinstonix thanks for creating this issue. Do you know if there is a way to reproduce this problem? Looking at the code, we are setting eventID based on objectHash and offset(https://github.com/elastic/beats/blob/master/x-pack/filebeat/input/s3/input.go#L638). When a file failed to process, the SQS message should go back to the queue and then the whole/same file would be re-processed later. This way, I think(in theory) the objectHash and offset both should be the same as last time. This is how we are thinking to solve the duplicate events issue. With what you are seeing, does that mean either the objectHash or the offset value for individual log entry changed during the second time processing this same file? TIA!!

@kwinstonix
Copy link
Contributor Author

kwinstonix commented Jul 15, 2020

Thanks for the reply, I see the event _id in filebeat event, but i use Kafka output instead of Elastic Search and I can not control Kafka consumer logstash config, so I can't set the doc Id in ElasticSearch. On the other hand using self-generate doc _id impact Elastic Search performance. Hash Id is a solution indeed, I just wonder whether there is a way to set the offset of s3 object🤔

@kwinstonix
Copy link
Contributor Author

Only when the input is local file, the offset position is recorded into registry. Is it right?

@kaiyan-sheng
Copy link
Contributor

Ahh I think I found a bug in the code about offset. Will fix that soon!

@kaiyan-sheng kaiyan-sheng self-assigned this Jul 15, 2020
@kaiyan-sheng
Copy link
Contributor

This should be fixed as a part of #19962, still in progress and need more testing.

@kaiyan-sheng
Copy link
Contributor

Hi @kwinstonix, #19962 is merged into master branch. I think that will fix this issue. I will close this issue for now and if you get a chance to test it and still seeing duplicate events, please feel free to open a new issue or reopen this one! Thank you!!

@kaiyan-sheng
Copy link
Contributor

More fix on the offset: #20370

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
Development

No branches or pull requests

4 participants