High io consumption after sudden filebeat stop #35893

Hitych · 2023-06-23T03:42:57Z

Hi! I tried to ask on discuss.elastic.co but no answer.

The problem is very high io, after sudden termination of a filebeat. The reason is a checkpoint action on each log operation. It is because of log_invalid flag set to true, after failed initial log read operation. After abnormal termination of a filebeat, log may be in a inconsistent state and read of log like this can cause error Incomplete or corrupted log file in /usr/share/filebeat/data/registry/filebeat. Continue with last known complete and consistent state. Reason: invalid character '\\x00' looking for beginning of value
After that, filebeat clears log file, but still not trying to write, and just make checkpoint by checkpoint.

Version: 8.1.0 but i think bug still in the master
Operating System: Ubuntu 18.04 kernel 5.4.0-139-generic
Discuss Forum URL: https://discuss.elastic.co/t/high-iops-from-filebeat/334399
Steps to Reproduce:

Start filebeat
Shutdown machine suddenly
Start machine again
Start filebeat
Check the log for an errors

The text was updated successfully, but these errors were encountered:

emmanueltouzery · 2024-02-12T09:16:08Z

We are seeing the same issue:
https://discuss.elastic.co/t/filebeat-causing-a-very-large-iowait-and-lagging-after-uncontrolled-reboot/351981

Hitych · 2024-03-15T04:17:26Z

@elastic/obs-dc can anyone help here?

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing.

elasticmachine · 2024-05-03T19:40:48Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2024-05-03T19:45:31Z

Hey folks, thanks for finding this bug and proposing a fix! Looking at the code I can see it indeed is a bug. Restarting Filebeat should bring it back into a consistent state. While not perfect, it is at least a workaround.

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing.

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go

In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co> (cherry picked from commit 217f5a6)

…35893) (#39842) * Fix high IO after sudden filebeat stop (#35893) (#39392) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co> (cherry picked from commit 217f5a6) * Update CHANGELOG.next.asciidoc --------- Co-authored-by: emmanueltouzery <etouzery@gmail.com> Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>

…35893) (#39795) In case of corrupted log file (which has good chances to happen in case of sudden unclean system shutdown), we set a flag which causes us to checkpoint immediately, but never do anything else besides that. This causes Filebeat to just checkpoint on each log operation (therefore causing a high IO load on the server and also causing Filebeat to fall behind). This change resets the logInvalid flag after a successful checkpointing. Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co> (cherry picked from commit 217f5a6) # Conflicts: # libbeat/statestore/backend/memlog/diskstore.go --------- Co-authored-by: emmanueltouzery <etouzery@gmail.com> Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co> Co-authored-by: Pierre HILBERT <pierre.hilbert@elastic.co>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 23, 2023

emmanueltouzery mentioned this issue May 3, 2024

[Bug] fix high IO after sudden filebeat stop (#35893) #39392

Merged

6 tasks

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 3, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 3, 2024

belimawr closed this as completed in #39392 Jun 4, 2024

mergify bot mentioned this issue Jun 4, 2024

[7.17](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39795

Merged

6 tasks

mergify bot mentioned this issue Jun 10, 2024

[8.14](backport #39392) [Bug] fix high IO after sudden filebeat stop (#35893) #39842

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High io consumption after sudden filebeat stop #35893

High io consumption after sudden filebeat stop #35893

Hitych commented Jun 23, 2023 •

edited

Loading

emmanueltouzery commented Feb 12, 2024 •

edited

Loading

Hitych commented Mar 15, 2024

elasticmachine commented May 3, 2024

belimawr commented May 3, 2024

High io consumption after sudden filebeat stop #35893

High io consumption after sudden filebeat stop #35893

Comments

Hitych commented Jun 23, 2023 • edited Loading

emmanueltouzery commented Feb 12, 2024 • edited Loading

Hitych commented Mar 15, 2024

elasticmachine commented May 3, 2024

belimawr commented May 3, 2024

Hitych commented Jun 23, 2023 •

edited

Loading

emmanueltouzery commented Feb 12, 2024 •

edited

Loading