Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Journald] input crashes with "failed to read message field: cannot allocate memory" #39352

Closed
Tracked by #37086
belimawr opened this issue May 1, 2024 · 9 comments · Fixed by #40558
Closed
Tracked by #37086
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

Filebat: 8.13.2
Host OS: Amazon Linux 2
Systemd/Journald version: systemd 252 (252.16-1.amzn2023.0.2)

journalctl --version
systemd 252 (252.16-1.amzn2023.0.2)
+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 -BZIP2 -LZ4 +XZ -ZLIB -ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

##How to reproduce

  1. Flood jounrald with logs so it it rotates logs every minute or so. Mostly follow [Filebeat] Journald causes Filebeat to crash #34077 (comment)
  2. Start Filebeat with the config from the above link
  3. Wait until Journald reaches its maximum number of files and starts deleting old entries
  4. Filebeat might crash due to [Filebeat] Journald causes Filebeat to crash #34077, it's ok. Ignore it
  5. Let the logs flowing for a while (I waited for hours)
  6. Start Filebeat again
  7. Journald input will fail with:
    {"log.level":"error","@timestamp":"2024-05-01T19:29:01.010Z","log.logger":"input.journald","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/v2/compat.(*runner).Start.func1","file.name":"compat/compat.go","file.line":132},"message":"Input 'journald' failed with: input.go:130: input journald-input failed (id=journald-input)\n\tfailed to read message field: cannot allocate memory","service.name":"filebeat","id":"journald-input","ecs.version":"1.6.0"}
    

Sometimes Filebeat might just crash again. I also saw it failing once or twice with the same message as in #32782.

Both seem to be relates with Filebeat being too far behind reading the journal, probably further behind than what journald has got stored in disk.

On both cases the error is coming from the Journald library we use, github.com/coreos/go-systemd/v22

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 1, 2024
@belimawr belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 1, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@belimawr
Copy link
Contributor Author

belimawr commented Aug 9, 2024

Even after merging #40061, I can still reproduce this "cannot allocate memory" error.

Here is how the new log error look:

{
  "log.level": "error",
  "@timestamp": "2024-08-08T21:02:16.242Z",
  "log.logger": "input.journald",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/filebeat/input/journald/pkg/journalctl.(*Reader).Close",
    "file.name": "journalctl/reader.go",
    "file.line": 256
  },
  "message": "Journalctl wrote to stderr: Failed to get journal fields: Cannot allocate memory\n",
  "service.name": "filebeat",
  "id": "PR-testing",
  "input_source": "LOCAL_SYSTEM_JOURNAL",
  "path": "LOCAL_SYSTEM_JOURNAL",
  "ecs.version": "1.6.0"
}

This seems to be happening in journalctl itself, it's reproducible in the same way, but it takes longer to happen. I noticed it happening more often when the system is under pressure (VM with all CPUs at 100% and 2Gb of ram).

Interestingly enough this situation is also reproducible when running the Otel journald receiver that also calls journalctl and reads it's JSON output.

I believe the easiest solution for this on Filebeat is to be more resilient to journalctl crashes, restarting it instead of stopping the input.

@belimawr
Copy link
Contributor Author

belimawr commented Aug 9, 2024

@pierrehilbert I changed the status to 'need technical definition' because we need to decide how to handle this error, at the moment the best option seems to make the journald input resilient to journalctl crashes and then validate whether this is still an issue that needs to be addressed directly.

@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2024

Same as #32782 essentially.

We need to automatically recover from this if we aren't already.

@cmacknz
Copy link
Member

cmacknz commented Aug 12, 2024

This one seems more like system memory exhaustion though, how much memory is available on the host when it happens? Can other programs allocate memory?

@belimawr
Copy link
Contributor Author

This one seems more like system memory exhaustion though, how much memory is available on the host when it happens? Can other programs allocate memory?

IIRC increasing the VM's memory helped, but I didn't see the memory reach 100% in any of my tests. The CPUs were at 100%. I didn't see the whole system crashing, nor the system became unresponsive.

Once I start working on those issues I'll properly collect system metrics, probably with Metricbeat.

@belimawr
Copy link
Contributor Author

I'll start working on recovering if journalctl crashes, for this case simply restarting should suffice and should not create any problems with offset tracking.

@belimawr
Copy link
Contributor Author

This is not fixed by #40558

@belimawr belimawr reopened this Sep 13, 2024
@belimawr
Copy link
Contributor Author

I managed to reproduce this issue by just calling journalctl and the machine seems to have plenty of memory available:

[root@aws-test-journald ~]# journalctl --utc --output=json --follow --after-cursor "s=922eeded44734fd9b2fe892ceb4ec2df;i=2e21e0;b=5d68f7c3ebf040879a4ab3ccbb2d965b;m=2e2762073e;t=62206b43c435e;x=46e90e97a06137ad">journal4.ndjson
Failed to read journal: Cannot allocate memory
[root@aws-test-journald ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       1.0Gi       7.9Gi       120Mi       6.7Gi        14Gi
Swap:             0B          0B          0B
[root@aws-test-journald ~]# 

CPU usage is not low with some cores hitting 100%, but not all at the same time.

I believe the best we can do here is to make sure the input does not get stuck, which is done by #40558.

I'll close this issue as solved.

djaglowski pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this issue Oct 28, 2024
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
jpbarto pushed a commit to jpbarto/opentelemetry-collector-contrib that referenced this issue Oct 29, 2024
…-telemetry#35635)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
According to the community, there are bugs in systemd that could corrupt
the journal files or crash the log receiver:
systemd/systemd#24320
systemd/systemd#24150

We've seen some issues reported to Elastic/beats project:
elastic/beats#39352
elastic/beats#32782
elastic/beats#34077

Unfortunately, the otelcol is not immune from these issues. When the
journalctl process exits for any reason, the log consumption from
journald just stops. We've experienced this on some machines that have
high log volume. Currently we monitors the journalctl processes started
by otelcol, and restart the otelcol when some of them is missing. IMO,
The journald receiver itself should monitor the journalctl process it
starts, and does its best to keep it alive.

In this PR, we try to restart the journalctl process when it exits
unexpectedly. As long as the journalctl cmd can be started (via
`Cmd.Start()`) successfully, the journald_input will always try to
restart the journalctl process if it exits.

The error reporting behaviour changes a bit in this PR. Before the PR,
the `operator.Start` waits up to 1 sec to capture any immediate error
returned from journalctl. After the PR, the error won't be reported back
even if the journalctl exits immediately after start, instead, the error
will be logged, and the process will be restarted.

The fix is largely inspired by
elastic/beats#40558.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Add a simple bash script that print a line every second, and load it to
systemd.

`log_every_second.sh`:
```bash
#!/bin/bash
while true; do
    echo "Log message: $(date)"
    sleep 1
done
```

`log.service`:
```
[Unit]
Description=Print logs to journald every second
After=network.target

[Service]
ExecStart=/usr/local/bin/log_every_second.sh
Restart=always
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

Start the otelcol with the following config:
```yaml
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    logs:
      receivers: [journald]
      processors: []
      exporters: [debug]

receivers:
  journald:

exporters:
  debug:
    verbosity: basic
    sampling_initial: 1
    sampling_thereafter: 1
```

Kill the journalctl process and observe the otelcol's behaviour. The
journactl process will be restarted after the backoff period (hardcoded
to 2 sec):
```bash
2024-10-06T14:32:33.755Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 1}
2024-10-06T14:32:34.709Z	error	journald/input.go:98	journalctl command exited	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input", "error": "signal: terminated"}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/journald.(*Input).run
	github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.108.0/operator/input/journald/input.go:98
2024-10-06T14:32:36.712Z	debug	journald/input.go:94	Starting the journalctl command	{"kind": "receiver", "name": "journald", "data_type": "logs", "operator_id": "journald_input", "operator_type": "journald_input"}
2024-10-06T14:32:36.756Z	info	LogsExporter	{"kind": "exporter", "data_type": "logs", "name": "debug", "resource logs": 1, "log records": 10}
```

<!--Please delete paragraphs that you did not use before submitting.-->

---------

Signed-off-by: Mengnan Gong <namco1992@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants