Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly update offset in case of unparasable line #22685

Merged
merged 8 commits into from
Apr 1, 2021

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented Nov 20, 2020

What does this PR do?

This PR adds a fix for cases of docker reader meets an unparsable line and skips it (introduced at #12268). In such cases we should properly update the offset by adding the skipped bytes so as to point to the right byte.

Why is it important?

Having a wrong offset in the registry will make harvester start from the wrong offset in case of Filebeat's restarts or reopened files which will lead to another ErrLineUnparsable. Offset will never be healed from now on.

Testing notes

  1. make python-env
  2. source ./build/python-env/bin/activate
  3. make filebeat.test
  4. pytest tests/system/test_container.py

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark ChrsMark added bug Team:Platforms Label for the Integrations - Platforms team v7.11.0 labels Nov 20, 2020
@ChrsMark ChrsMark self-assigned this Nov 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 20, 2020
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 20, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #22685 updated

  • Start Time: 2021-03-31T21:40:08.143+0000

  • Duration: 56 min 16 sec

  • Commit: e59afad

Test stats 🧪

Test Results
Failed 0
Passed 46514
Skipped 5132
Total 51646

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 46514
Skipped 5132
Total 51646

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This will need to be backported to 7.10 and 6.8 too.

@@ -333,7 +333,8 @@ func (h *Harvester) Run() error {
logp.Info("File is inactive: %s. Closing because close_inactive of %v reached.", h.state.Source, h.config.CloseInactive)
case reader.ErrLineUnparsable:
logp.Info("Skipping unparsable line in file: %v", h.state.Source)
//line unparsable, go to next line
//line unparsable, update offset and go to next line
h.state.Offset += int64(message.Bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also update read offset metrics as some lines below?

// Update metics of harvester as event was sent
h.metrics.readOffset.Set(state.Offset)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@urso it would be good to have your input here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these two should always be kept in sync.

@exekias
Copy link
Contributor

exekias commented Nov 20, 2020

Great work @ChrsMark! would it be possible to add some unit tests in the harvester code?

@ChrsMark
Copy link
Member Author

ChrsMark commented Nov 23, 2020

Great work @ChrsMark! would it be possible to add some unit tests in the harvester code?

Harvester's tests so far are not really detailed (https://github.com/exekias/beats/blob/7007d97c6aadc4621d58d9d3122dd8e0c5115bb5/filebeat/input/log/harvester_test.go#L127) and hence it would require some effort to add UTs from scratch to cover that case, I think we can rely on system tests. wdyt?

@urso
Copy link

urso commented Nov 23, 2020

Having a wrong offset in the registry will make harvester start from the wrong offset in case of Filebeat's restarts or reopened files which will lead to another ErrLineUnparsable. Offset will never be healed from now on.

Why doesn't Filebeat recover here? Do you have a link to an issue?

(Not asking to clean this up now, but wondering if we have some debt to open an issue for):
The harvesters purpose is to setup the reader on restart, receive and publish events from the reader. Would the processing more stable if the reader would handle the "unparsable" case instead (drop event and continue parsing)? In that case the offset would still be old and the reader would have to skip the unparsable contents on restart again, but the registry might be in a more predictable state, right?

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

@urso Filebeat will recover and will be able to parse lines after the first failure however the offset will continue to be inconsistent and hence every time that harvesting is being restarted the first parse will fail since it starts with wrong offset and will not be a valid CRI nor docker-json line.

We don't have an issue for this, I can open one if you want, we just found it with @exekias while investigating an SDH(see link above).

@exekias
Copy link
Contributor

exekias commented Nov 24, 2020

The issue happens like this:

For some reason we end up in the wrong offset, we still need to investigate why that happens (maybe a truncate on a symlink?).

Problem is that once you reach this state, the current harvester code won't heal from it. When the input reads the new line, it cannot parse it because of being in the wrong offset, it still updates the message.Bytes. When we handle this error we don't update the offset with the new Bytes data, so any further update to the offset is wrong, as it missed the update from the wrong line.

The problem happens again as soon as you restart the input

@urso
Copy link

urso commented Nov 26, 2020

Problem is that once you reach this state, the current harvester code won't heal from it. When the input reads the new line, it cannot parse it because of being in the wrong offset, it still updates the message.Bytes. When we handle this error we don't update the offset with the new Bytes data, so any further update to the offset is wrong, as it missed the update from the wrong line.

I see. Actually the reader is required to only return valid messages. If we need to update the offset based on skipped contents, we maybe have to reconsider the return type from the Reader interface to also report the amount of bytes that have been consumed in order to produce the message. Actually message.Bytes was supposed to be this, report the number of bytes that have been consumed from the file. For example after converting from JIS to UT-8 the length of the contents and the value of message.Bytes differ (there are other cases where these differ).

The harvester can not tell which offset is correct or not, as the harvester does not care about the contents. The issue looks like truncate indeed. Here the reader should make an attempt to check if that is the case "very likely if this comes from a container" and send an ErrTruncate, which will force the Harvester to consider the file to be new. If the reader thinks the file is just broken, the reader should make an attempt to find a safe point in the file it continue to read from without having to notify the harvester at all.
All errors but ErrUnparasable are about detecting underlying changes that invalidate the reader, basically considering us the file to be 'done' as is. It looks like it rather makes sense to remove ErrUnparasable + ensure that readers always report state that is continuable upon restart.

@exekias
Copy link
Contributor

exekias commented Mar 26, 2021

I think we should get back to this issue, we have seem a few cases related to this. In general I agree this looks like a truncate, but the fact that we stop sending logs after this creates a really bad experience.

I would rather get this in with the proposed changes as we keep investigating issues related to the different log rotation mechanisms

@ChrsMark
Copy link
Member Author

So if I understand correctly the preferred way to fix this is to change DockerJSONReader so as to find the next valid line by skipping the error at

return message, reader.ErrLineUnparsable
(continue the for loop instead of returning error)?

@exekias
Copy link
Contributor

exekias commented Mar 29, 2021

That could be an option to improve things. I liked what @urso said about ErrTruncate, I didn't know about this, do we? perhaps this would be a better way to handle this situation. It would make the input start over from the beginning, so we would miss less logs.

The only problem I see with trying to detect truncate situations is that permanent errors in the format (or bugs in our side) would end up in an infinite input restart, which could be a bad idea, I guess we can counter measure this with backoffs.

@urso
Copy link

urso commented Mar 29, 2021

Truncation is the most likely the cause, but no necessarily. If the input is configured with tail_files: true, we have another potential cause: whenever the input is restarted, the next offset might be in the middle of some event, as the current file size is taken as the next start offset. In that case it should be handled by the container input.

In general I would prefer to fix the docker json parser by 'ignoring' the error. The parser reports how many bytes it has consumed in order to generate the current 'message'. The parser can ignore the failure by resetting its internal buffer while keeping the current byte count. The new filestream input can detect file truncation asynchronously and restart the harvester in time, even if the harvester is blocked in the output and can't detect the truncation for that reason. This reduces the chance of missing logs when the file is truncated. Having the parser fixed, will allow us to reuse the parser in the filestream input in the future and be able to handle possible errors due to truncation, or having the wrong offset for other unknown causes.

For reference:

Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

@urso @exekias I tried to moved it to the direction of 'ignoring' the error in the reader for now based on @urso's latest comment. Let me know what you think.

Content: []byte{},
Bytes: 0,
}, io.EOF
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test were the first line is broken (from a truncate) but the next one is valid? we should get the content from the valid one only but bytes should account for both

Copy link
Contributor

@exekias exekias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came along really simple in the end! I left a comment around testing

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

This came along really simple in the end! I left a comment around testing

Added

@ChrsMark ChrsMark merged commit 655984e into elastic:master Apr 1, 2021
ChrsMark added a commit to ChrsMark/beats that referenced this pull request Apr 1, 2021
ChrsMark added a commit to ChrsMark/beats that referenced this pull request Apr 1, 2021
case reader.ErrLineUnparsable:
logp.Info("Skipping unparsable line in file: %v", h.state.Source)
//line unparsable, go to next line
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love when issues get solved by removing code 🙂

ChrsMark added a commit that referenced this pull request Apr 1, 2021
ChrsMark added a commit that referenced this pull request Apr 1, 2021
ChrsMark added a commit that referenced this pull request Apr 2, 2021
v1v added a commit to v1v/beats that referenced this pull request Apr 7, 2021
* upstream/master: (91 commits)
  [Filebeat] Change okta.target to nested field (elastic#24636)
  Add RFC5424 format support for syslog input  (elastic#23954)
  Fix links to Beats product pages (elastic#24821)
  [DOCS] Fix 'make setup' instructions for a new beat (elastic#24944)
  Remove duplicate decode_xml entry (elastic#24941)
  [libbeat] Add wineventlog schema to decode_xml processor (elastic#24726)
  [Elastic Agent] Add check for URL set when cert and cert key. (elastic#24904)
  feat: stage execution cache (elastic#24780)
  Fix error in Journalbeat commands (elastic#24880)
  Add baseline ECS 1.9.0 upgrade (elastic#24909)
  [Elastic Agent] Cloud container legacy apm files. (elastic#24896)
  [Elastic Agent]: Reduce allowed socket path length (elastic#24914)
  Add ability to destroy indices with wildcards in testing (elastic#24915)
  Add status subcommand to report status of running daemon. (elastic#24856)
  Fix types of fields GetHits and Ops in Metricbeat module for Couchbase (elastic#23287)
  Add support for Filestream input in elastic agent. (elastic#24820)
  Implement k8s secrets provider for Agent (elastic#24789)
  Sort processor list in docs (elastic#24874)
  Add support for SCRAM authentication in kafka metricbeat module (elastic#24810)
  Properly update offset in case of unparasable line (elastic#22685)
  ...
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Platforms Label for the Integrations - Platforms team v6.8.13 v7.12.1 v7.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants