Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] Flaky test_metricsets.Test.test_metricset_package #10633

Closed
ruflin opened this issue Feb 7, 2019 · 5 comments
Closed

[Auditbeat] Flaky test_metricsets.Test.test_metricset_package #10633

ruflin opened this issue Feb 7, 2019 · 5 comments
Assignees
Labels
Auditbeat flaky-test Unstable or unreliable test cases.

Comments

@ruflin
Copy link
Member

ruflin commented Feb 7, 2019

Flaky Test

Saw this failing once in master. Artifacts are attached

test_metricsets.Test.test_metricset_package.zip

@ruflin ruflin added Auditbeat flaky-test Unstable or unreliable test cases. labels Feb 7, 2019
ruflin added a commit to ruflin/beats that referenced this issue Feb 7, 2019
@cwurm cwurm self-assigned this Feb 7, 2019
@cwurm
Copy link
Contributor

cwurm commented Feb 7, 2019

The Auditbeat log file (auditbeat.log) suddenly cuts off. The system test log shows the exit code was -1, implying SIGHUP. Beats only handles SIGINT and SIGTERM:

signal.Notify(sigc, syscall.SIGINT, syscall.SIGTERM)

I tried sending SIGHUP locally - the log file looks exactly the same.

Question now is, why is it getting SIGHUP?

In any case, maybe we should treat SIGHUP the same as SIGINT and SIGTERM so Beats shuts down gracefully?

@cwurm
Copy link
Contributor

cwurm commented Feb 8, 2019

Forget what I said above. The return code is 1, not -1. I think I know what is happening though, and it does not look pretty.

I can reproduce the error locally on CentOS 7. It currently happens every time for me, I'm surprised it didn't fail the PR build.

What seems to be happening is that librpm installs signal traps for various UNIX signals, including SIGINT and SIGTERM (here). This overrides the existing ones in Beats. When Auditbeat is terminated (the system test sends SIGTERM), librpm cleans up its open RPM transaction and calls exit(1) (that's why the exit code is 1, here). The process exits, Beats never gets a chance to run its shutdown code.

This has been a problem for other applications before, e.g. there is this bug report from gdb. Following that, librpm added a way to disable its signal traps altogether (here). Unfortunately, that is not yet available in the default librpm version on CentOS 7 (or 6, for that matter).

What we can do (and what gdb ended up doing) is disable the signal traps after they are set. I have this patch that does this and eliminates the test failures on my local system.

It's not ideal, for two reasons I think:

  1. While we would disable the signal traps right after they are set, there is a very short moment where they are set. We're only talking maybe a few milliseconds, and it's only relevant if one of those signals is received (i.e. Beats is shutting down).
  2. Disabling the clean up code of a library is not a good idea in general. There is a reason librpm has this. We can do our own clean up (and we do, deferring rpmdbFreeIterator()), but it's not ideal and potentially error-prone. I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

@tsg @andrewkroh - what do you think about this?

@andrewkroh
Copy link
Member

Would it possible to use rpmsqSetInterruptSafety if it's available then fall-back to the manually removing the signal handler?

@tsg
Copy link
Contributor

tsg commented Feb 11, 2019

I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the rpm process while running, right? Given how many users rpm has, I'd say the risk of it is overly small. So I'm +1 on your patch, thanks for chasing this down!

@cwurm
Copy link
Contributor

cwurm commented Feb 12, 2019

I'm not sure at the moment what would happen if we ever failed to clean up. Would we prevent the package manager from running?

That sounds scary, but if that would be a possibility, it could also happen if you killed -9 the rpm process while running, right? Given how many users rpm has, I'd say the risk of it is overly small.

I've tried to simulate this by inserting a long time.Sleep() after acquiring all RPM data structures, but before freeing any of them up. Even in that time, yum was happy installing and removing packages. So I think we're ok. PR coming.

cwurm pushed a commit that referenced this issue Feb 18, 2019
Disable librpm signal handlers.

Resolves #10633.
cwurm pushed a commit to cwurm/beats that referenced this issue Feb 18, 2019
Disable librpm signal handlers.

Resolves elastic#10633.

(cherry picked from commit 6fbcbff)
cwurm pushed a commit to cwurm/beats that referenced this issue Feb 18, 2019
Disable librpm signal handlers.

Resolves elastic#10633.

(cherry picked from commit 6fbcbff)
cwurm pushed a commit to cwurm/beats that referenced this issue Feb 18, 2019
Disable librpm signal handlers.

Resolves elastic#10633.

(cherry picked from commit 6fbcbff)
cwurm pushed a commit that referenced this issue Feb 19, 2019
… handlers (#10797)

Cherry-pick of PR #10694 to 7.x branch. Original message: 

Disable librpm signal handlers.

Resolves #10633.
cwurm pushed a commit that referenced this issue Feb 19, 2019
… handlers (#10798)

* [Auditbeat] Package: Disable librpm signal handlers (#10694)

Disable librpm signal handlers.

Resolves #10633.

(cherry picked from commit 6fbcbff)
cwurm pushed a commit that referenced this issue Feb 19, 2019
… handlers (#10799)

Cherry-pick of PR #10694 to 6.7 branch. Original message: 

Disable librpm signal handlers.

Resolves #10633.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auditbeat flaky-test Unstable or unreliable test cases.
Projects
None yet
Development

No branches or pull requests

4 participants