Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Closed
Tracked by #37086
snjoetw opened this issue Aug 23, 2022 · 16 comments · Fixed by #40558
Closed
Tracked by #37086
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@snjoetw
Copy link

snjoetw commented Aug 23, 2022

Filebeat seems to stop sending journald logs when encountered "failed to read message field: bad message" error

journald input:

- type: journald
  id: appdaemon_logs
  paths:
    - /var/log/journal
  include_matches.match:
    - syslog.priority=6

Logs:

{
	"log.level": "error",
	"@timestamp": "2022-08-23T00:31:06.768-0700",
	"log.logger": "input.journald",
	"log.origin": {
		"file.name": "compat/compat.go",
		"file.line": 124
	},
	"message": "Input 'journald' failed with: input.go:130: input appdaemon_logs failed (id=appdaemon_logs)\n\tfailed to read message field: bad message",
	"service.name": "filebeat",
	"id": "appdaemon_logs",
	"ecs.version": "1.6.0"
}

Filebeat 8.3.3

Similar bug is also reported in Loki, not sure if the fix is similar or not:
Bug Report: grafana/loki#2812
PR: grafana/loki#2928

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2022
@jsoriano
Copy link
Member

Could be related to the issues discussed in #23627

@jsoriano jsoriano added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 25, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 25, 2022
@trauta
Copy link

trauta commented Jan 18, 2023

I'm having the same problem with filebeat filebeat version 7.17.8 (amd64) on Ubuntu 20.04.5 LTS.

Here is the journald input config:

- type: journald
  id: syslog
  seek: cursor
  fields_under_root: false
  tags:
    - journald
    - syslog

From the filebeat log:

2023-01-11T12:19:13.844+0100	ERROR	[input.journald]	compat/compat.go:124	Input 'journald' failed with: input.go:130: input syslog failed (id=syslog)
	failed to iterate journal: bad message	{"id": "syslog"}

After this error message the journald input completey stops, there are no new journald events transmitted.
When restarting filebeat via systemd the missing journal events get transmitted without any problems.

This is a very irritating behavior.

Is it possible to implement a fix so that in case of such an error the 'bad mesage' is skipped and the input continues to parse the other messages?

@trauta
Copy link

trauta commented Jun 30, 2023

Hi, any updates on this?

@cmacknz
Copy link
Member

cmacknz commented Jul 4, 2023

Sorry, we haven't been able to prioritize this issue yet.

@nicklausbrown
Copy link

Any chance this will be prioritized soon, or a timeline on promoting journald out of tech preview? Thanks!

@georgivalentinov
Copy link

georgivalentinov commented Aug 9, 2023

Seems like the underlying issue is with systemd - they seem to have introduced a regression and have trouble finding and fixing it.
That being said, this is a major issue as people are moving away from log files and the use of journald is prevalent these days.

Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest? As is today it blocks and silently stops processing any log entries upon stumbling into a problematic spot.

@groegeorg
Copy link

groegeorg commented Apr 18, 2024

I'm also desperately hoping for a fix on this issue. I second georgivalentinov's statement: "Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest?"

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

We have taking journald out of tech preview as a priority (see #37086) and this needs to be fixed as part of that work.

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

I've been investigating this crash, it is reproducible like #34077, however it also happens with the following version of journald from Ubuntu 24.04 LTS

systemd 255 (255.4-1ubuntu8)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

It is coming from

entry, err := r.journal.GetEntry()

@belimawr
Copy link
Contributor

belimawr commented Aug 9, 2024

Even after merging #40061, I can still reproduce this "cannot allocate memory" error.

Here is how the new log error look:

{
  "log.level": "error",
  "@timestamp": "2024-08-08T20:33:26.947Z",
  "log.logger": "input.journald",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.(*Reader).Close",
    "file.name": "journalctl/reader.go",
    "file.line": 256
  },
  "message": "Journalctl wrote to stderr: Failed to iterate through journal: Bad message\n",
  "service.name": "filebeat",
  "id": "PR-testig",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

It seems that even journalctl is struggling to read some messages, what makes it hard to debug is that so far I've only managed to reproduce it when the system and the journal are under high load and sometimes I get a memory error first (see #39352).

The solution here seems to be the same, make the journald input restart journalctl, however we need to understand what happens the message/cursor. Is the message lost? Will we get stuck on this message? Can we skip it in code and log an error/warning for the user?

@belimawr
Copy link
Contributor

belimawr commented Aug 9, 2024

@pierrehilbert I changed the status to 'need technical definition' because we need to decide how to handle this error, at the moment the best option seems to make the journald input resilient to journalctl crashes and then validate whether this is still an issue that needs to be addressed directly.

@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2024

the moment the best option seems to make the journald input resilient to journalctl crashes

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

@belimawr
Copy link
Contributor

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

I agree, but we need to be careful on how we implement this to avoid getting stuck on a "bad message", so far I only managed to reproduce this when putting the system under stress to test/investigate the journal rotation related crashes.

We already have an issue to track this improvement: #39355.

@belimawr
Copy link
Contributor

This is not fixed by #40558

@belimawr belimawr reopened this Sep 13, 2024
@belimawr
Copy link
Contributor

I've just managed to reproduce this issue by calling journalctl directly:

[root@aws-test-journald ~]# journalctl --utc --output=json --follow --after-cursor 's=922eeded44734fd9b2fe892ceb4ec2df;i=25a960;b=5d68f7c3ebf040879a4ab3ccbb2d965b;m=1a474efb16;t=621f2d4293733;x=c148d181bdad7f5b' > ./journal2.ndjson                                      
Failed to iterate through journal: Bad message                                                                                                                                                                                         

Both, Filebeat and journalctl were able to continue reading the journal using the last know cursor, and I did not see any indication that we'd get stuck in a "Bad message" crash loop, so I'm closing this issue as solved by #40558

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants