Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filbeat stops shipping journald logs when encountered "failed to read message field: bad message" error #32782

Closed
Tracked by #37086
snjoetw opened this issue Aug 23, 2022 · 16 comments · Fixed by #40558
Closed
Tracked by #37086
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@snjoetw
Copy link

snjoetw commented Aug 23, 2022

Filebeat seems to stop sending journald logs when encountered "failed to read message field: bad message" error

journald input:

- type: journald
  id: appdaemon_logs
  paths:
    - /var/log/journal
  include_matches.match:
    - syslog.priority=6

Logs:

{
	"log.level": "error",
	"@timestamp": "2022-08-23T00:31:06.768-0700",
	"log.logger": "input.journald",
	"log.origin": {
		"file.name": "compat/compat.go",
		"file.line": 124
	},
	"message": "Input 'journald' failed with: input.go:130: input appdaemon_logs failed (id=appdaemon_logs)\n\tfailed to read message field: bad message",
	"service.name": "filebeat",
	"id": "appdaemon_logs",
	"ecs.version": "1.6.0"
}

Filebeat 8.3.3

Similar bug is also reported in Loki, not sure if the fix is similar or not:
Bug Report: grafana/loki#2812
PR: grafana/loki#2928

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2022
@jsoriano
Copy link
Member

Could be related to the issues discussed in #23627

@jsoriano jsoriano added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 25, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 25, 2022
@trauta
Copy link

trauta commented Jan 18, 2023

I'm having the same problem with filebeat filebeat version 7.17.8 (amd64) on Ubuntu 20.04.5 LTS.

Here is the journald input config:

- type: journald
  id: syslog
  seek: cursor
  fields_under_root: false
  tags:
    - journald
    - syslog

From the filebeat log:

2023-01-11T12:19:13.844+0100	ERROR	[input.journald]	compat/compat.go:124	Input 'journald' failed with: input.go:130: input syslog failed (id=syslog)
	failed to iterate journal: bad message	{"id": "syslog"}

After this error message the journald input completey stops, there are no new journald events transmitted.
When restarting filebeat via systemd the missing journal events get transmitted without any problems.

This is a very irritating behavior.

Is it possible to implement a fix so that in case of such an error the 'bad mesage' is skipped and the input continues to parse the other messages?

@trauta
Copy link

trauta commented Jun 30, 2023

Hi, any updates on this?

@cmacknz
Copy link
Member

cmacknz commented Jul 4, 2023

Sorry, we haven't been able to prioritize this issue yet.

@nicklausbrown
Copy link

Any chance this will be prioritized soon, or a timeline on promoting journald out of tech preview? Thanks!

@georgivalentinov
Copy link

georgivalentinov commented Aug 9, 2023

Seems like the underlying issue is with systemd - they seem to have introduced a regression and have trouble finding and fixing it.
That being said, this is a major issue as people are moving away from log files and the use of journald is prevalent these days.

Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest? As is today it blocks and silently stops processing any log entries upon stumbling into a problematic spot.

@groegeorg
Copy link

groegeorg commented Apr 18, 2024

I'm also desperately hoping for a fix on this issue. I second georgivalentinov's statement: "Any chance such records (or entire journald files) can be skipped by Filebeat and just continue with the rest?"

@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

We have taking journald out of tech preview as a priority (see #37086) and this needs to be fixed as part of that work.

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

I've been investigating this crash, it is reproducible like #34077, however it also happens with the following version of journald from Ubuntu 24.04 LTS

systemd 255 (255.4-1ubuntu8)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

It is coming from

entry, err := r.journal.GetEntry()

@belimawr
Copy link
Contributor

belimawr commented Aug 9, 2024

Even after merging #40061, I can still reproduce this "cannot allocate memory" error.

Here is how the new log error look:

{
  "log.level": "error",
  "@timestamp": "2024-08-08T20:33:26.947Z",
  "log.logger": "input.journald",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.(*Reader).Close",
    "file.name": "journalctl/reader.go",
    "file.line": 256
  },
  "message": "Journalctl wrote to stderr: Failed to iterate through journal: Bad message\n",
  "service.name": "filebeat",
  "id": "PR-testig",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

It seems that even journalctl is struggling to read some messages, what makes it hard to debug is that so far I've only managed to reproduce it when the system and the journal are under high load and sometimes I get a memory error first (see #39352).

The solution here seems to be the same, make the journald input restart journalctl, however we need to understand what happens the message/cursor. Is the message lost? Will we get stuck on this message? Can we skip it in code and log an error/warning for the user?

@belimawr
Copy link
Contributor

belimawr commented Aug 9, 2024

@pierrehilbert I changed the status to 'need technical definition' because we need to decide how to handle this error, at the moment the best option seems to make the journald input resilient to journalctl crashes and then validate whether this is still an issue that needs to be addressed directly.

@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2024

the moment the best option seems to make the journald input resilient to journalctl crashes

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

@belimawr
Copy link
Contributor

Even if this bug didn't exist we should be doing this and automatically recovering from as many problem situations as we can.

I agree, but we need to be careful on how we implement this to avoid getting stuck on a "bad message", so far I only managed to reproduce this when putting the system under stress to test/investigate the journal rotation related crashes.

We already have an issue to track this improvement: #39355.

@belimawr
Copy link
Contributor

This is not fixed by #40558

@belimawr belimawr reopened this Sep 13, 2024
@belimawr
Copy link
Contributor

I've just managed to reproduce this issue by calling journalctl directly:

[root@aws-test-journald ~]# journalctl --utc --output=json --follow --after-cursor 's=922eeded44734fd9b2fe892ceb4ec2df;i=25a960;b=5d68f7c3ebf040879a4ab3ccbb2d965b;m=1a474efb16;t=621f2d4293733;x=c148d181bdad7f5b' > ./journal2.ndjson                                      
Failed to iterate through journal: Bad message                                                                                                                                                                                         

Both, Filebeat and journalctl were able to continue reading the journal using the last know cursor, and I did not see any indication that we'd get stuck in a "Bad message" crash loop, so I'm closing this issue as solved by #40558

djaglowski pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Oct 28, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
jpbarto pushed a commit to jpbarto/opentelemetry-collector-contrib that referenced this issue Oct 29, 2024
…-telemetry#35635)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants