Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] Recover from errors in audit monitoring routine #22673

Merged
merged 2 commits into from
Nov 24, 2020

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Nov 19, 2020

The auditd module spawns a monitoring goroutine that fetches auditd status every 15s. Due to this routine using a single audit client, if an update fails (because a netlink message is late or other causes), the audit client can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition leads to a lot of [audit_send_repl] (2.6.x) / [audit_send_reply] (3.x+) kernel threads being created. (Reproduced in 2.6.32, no other versions tested).

The following error will appear every 15s:

ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)

ps -ef will show a lot of audit_send_repl threads:

[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]

This patch updates the error-handling logic to create a new audit client when a status update fails, allowing to recover and preventing the proliferation of audit_send_repl kernel threads.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

It's easy to reproduce this issue by modifying the code at

if ms.kernelLost.enabled {
client, err := libaudit.NewAuditClient(nil)
if err != nil {
reporter.Error(err)
ms.log.Errorw("Failure creating audit monitoring client", "error", err)
}
go func() {
defer client.Close()
timer := time.NewTicker(lostEventsUpdateInterval)
defer timer.Stop()
for {
select {
case <-reporter.Done():
return
case <-timer.C:
if status, err := client.GetStatus(); err == nil {
ms.updateKernelLostMetric(status.Lost)
} else {
ms.log.Error("get status request failed:", err)
}
}
}
}()
}

  • Set a lower update interval (milliseconds)
  • call client.GetStatusAsync(false) outside of the polling loop.

Similar can be used to validate this fix. Ideally sending an async getstatus every few iterations of the loop.

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.
@adriansr adriansr added bug review needs_backport PR is waiting to be backported to other branches. Team:Security-External Integrations labels Nov 19, 2020
@adriansr adriansr requested a review from a team as a code owner November 19, 2020 13:00
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 19, 2020
@adriansr adriansr requested review from a team and removed request for a team November 19, 2020 13:14
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 19, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #22673 updated]

  • Start Time: 2020-11-19T15:58:29.778+0000

  • Duration: 23 min 50 sec

Test stats 🧪

Test Results
Failed 0
Passed 455
Skipped 61
Total 516

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 455
Skipped 61
Total 516

}
client, err = libaudit.NewAuditClient(nil)
if err != nil {
ms.log.Errorw("Failure creating audit monitoring client", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this one fails then I think client == nil which will cause problems on subsequent iterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh! Good catch

@adriansr adriansr merged commit ca9550f into elastic:master Nov 24, 2020
@adriansr adriansr deleted the ab_audit_client_recover branch November 24, 2020 09:23
@adriansr adriansr added v7.11.0 and removed needs_backport PR is waiting to be backported to other branches. labels Nov 24, 2020
adriansr added a commit to adriansr/beats that referenced this pull request Nov 24, 2020
…22673)

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.

(cherry picked from commit ca9550f)
adriansr added a commit to adriansr/beats that referenced this pull request Nov 24, 2020
…22673)

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.

(cherry picked from commit ca9550f)
adriansr added a commit that referenced this pull request Nov 24, 2020
…22725)

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.

(cherry picked from commit ca9550f)
adriansr added a commit that referenced this pull request Nov 24, 2020
…22724)

The auditd module spawns a monitoring goroutine that fetches auditd status
every 15s. Due to this routine using a single audit client, if an update
fails (because a netlink message is late or other causes), the audit client
can get out of sync with the stream, failing in all subsequent requests.

For reasons that aren't 100% clear to me at the moment, this error condition
leads to a lot of `[audit_send_repl]` (2.6.x) / `[audit_send_reply]` (3.x+)
kernel threads being created.

```
ERROR [auditd] auditd/audit_linux.go:183 get status request failed:failed to get audit status ack: unexpected sequence number for reply (expected 6286 but got 6285)
```

```
$ ps -ef
[...]
root     27790     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27791     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27792     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27793     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27794     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27795     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27796     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27797     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27798     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27799     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27800     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27801     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27802     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27803     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27804     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27805     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27806     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27807     2  0 12:52 ?        00:00:00 [audit_send_repl]
root     27808     2  0 12:52 ?        00:00:00 [audit_send_repl]
[...]
```

This patch updates the error-handling logic to create a new audit client
when a status update fails, allowing to recover and preventing the
proliferation of `audit_send_repl` kernel threads.

(cherry picked from commit ca9550f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants