Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journal: A "bad message" error results in an infinite hot-loop #4053

Closed
href opened this issue Jul 27, 2021 · 8 comments
Closed

Journal: A "bad message" error results in an infinite hot-loop #4053

href opened this issue Jul 27, 2021 · 8 comments
Labels
component/promtail type/bug Somehing is not working as expected

Comments

@href
Copy link

href commented Jul 27, 2021

Description
The systemd journal library may at times follow logs that it cannot process. When that happens, promtail enters an infinite loop that will produce the following error messages as fast as it can (with newlines for claritry):

Jul 23 08:24:35 host promtail[425]:
level=error
ts=2021-07-23T06:24:35.747004119Z
caller=journaltarget.go:185
msg="received error during sdjournal follow"
err="failed to iterate journal: bad message"

This is likely related to #2928, which introduced a loop over the journal follow logic.

Environment
Systemd 241 and 245

promtail, version 2.2.1 (branch: HEAD, revision: babea82e)
build user:       root@e2d295b84e26
build date:       2021-04-06T00:52:41Z
go version:       go1.15.3
platform:         linux/amd64

Theory
The intention of #2928 is to skip log entries when "bad message" occurs, but from my study of the source code involved, this is likely not what's happening:

The error clearly bubbles out of Follow in the following snippet:

err := t.r.Follow(until, ioutil.Discard)
if err != nil {
if err == sdjournal.ErrExpired || err == io.EOF {
return
}
level.Error(t.logger).Log("msg", "received error during sdjournal follow", "err", err.Error())
}

The error is logged and Follow is called again. For it to produce a differing result, the current line would have to be skipped at some point, or we are likely reading the same result over and over again (which is what we are observing).

Inside Follow the Read function is called, which is the function that advances the current cursor here:

func (r *JournalReader) Read(b []byte) (int, error) {
if r.msgReader == nil {
// Advance the journal cursor. It has to be called at least one time
// before reading
c, err := r.journal.Next()
// An unexpected error
if err != nil {
return 0, err
}

Internally, this calls sd_journal_next, which in turn calls real_journal_next:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L816

Towards the end of that function, journal_file_move_to_object is called. This is where EBADMSG comes from! Unfortunately, reading a bad message does not advance the cursor, as an error forces to function to exit before that:
https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L861-L867

Expected Behavior
Ideally we would like to skip bad messages, but I'm not sure this is possible, as the skip functions systemd offers are really just "next" calls with a counter. It might make sense to treat these errors like end-of-file errors, since this seems to be what journald is doing internally:

https://unix.stackexchange.com/a/87862/413524

@cyriltovena
Copy link
Contributor

cyriltovena commented Jul 27, 2021

May be one of those two could help https://github.com/coreos/go-systemd/blob/main/sdjournal/journal.go#L572-L606 ? what do you think ? Although it's not available in the reader/follow interface so we might need to fork it or contribute upstream to it to give access to the underlying journal.

@href
Copy link
Author

href commented Jul 27, 2021

My impression is that it won't, since Next is already called every time Follow is called. It seems to be unable to move beyond the bad message. And internally, next skip is just implemented as a number of next calls:

https://github.com/systemd/systemd/blob/91d0750dbf65e1ffa627fa880c50673a27758cf6/src/libsystemd/sd-journal/sd-journal.c#L897-L907

Interestingly, I can run journalctl --verify without issue, so the journal files should be okay. I'm a bit limited in what I can check at the moment though. Hopefully I can follow-up with something reproducible.

@dannykopping
Copy link
Contributor

@href please see #4066 and give us your thoughts.
This won't really solve the problem of stepping over bad messages (@cyriltovena and I paired up for an hour on this today and could not reproduce), but it does prevent the hot-loop at the very least.

@href
Copy link
Author

href commented Jul 28, 2021

I haven't been able to reproduce this in the past 24 hours either (before it repeatedly happened on multiple hosts over the course of a few days). What those hosts share is a recent hard reboot, so we suspect that this somehow introduced corrupt journal files. Though we were not able to confirm this theory with systemctl --verify.

In any case, I had a look at your PR and to me it looks great. This would certainly defuse the situation and would allow us to go live with promtail. We feared that it would repeat in prod, causing our log partitions to quickly fill up. I don't see that happening again with your change. To keep trying every 100ms for other errors seems alright to me.

Thanks a lot for fixing this so quickly! I'm impressed ❤️

@dannykopping
Copy link
Contributor

Awesome, thanks for reporting this @href and happy Promtailing 🎉

@href
Copy link
Author

href commented Sep 8, 2021

I was wondering, when will this be included in a release? I didn't see any mention of this in the 2.3.0 change log, so I'm unsure if this made it into it or not.

@dannykopping
Copy link
Contributor

dannykopping commented Sep 8, 2021

@href hhmm, doesn't seem like it was included in v2.3.0, for some reason.
I'll make sure it's included in the v2.3.1 patch release coming soon

@href
Copy link
Author

href commented Sep 8, 2021

Awesome, thank you!

@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/promtail type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

4 participants