Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/osquery/runtime improvements, largely around improving test flakiness #1798

Merged
merged 9 commits into from
Jul 25, 2024

Conversation

RebeccaMahany
Copy link
Contributor

@RebeccaMahany RebeccaMahany commented Jul 25, 2024

I've noticed that the runtime tests have been particularly flaky on Windows lately and it's been holding up CI significantly because we often have to wait for the 10-min test timeout to hit before test failure. See e.g. https://github.com/kolide/launcher/actions/runs/10094889355/job/27918355827.

Issues found

  • runner.Shutdown was getting stuck; tests would panic after 10 minutes due to timeout. (This really holds up CI, plus the panic means we don't really get any useful information about what part of the test got stuck.)
  • waitHealthy reported that the osquery instance was healthy before we'd actually finished launching the instance. I believe that this was what caused runner.Shutdown to get stuck.
  • I saw in the logs that sometimes the PID file wasn't getting removed. Not really a big deal, but I fixed it anyway.

Fixes in this PR

Test improvements

  • Tests fail faster: add a one-minute timeout to calling runner.Shutdown, so that we can quickly fail and get useful troubleshooting information when Shutdown is stuck -- instead of waiting 10 minutes to hit test timeout + panic, which gives us no useful information about the issue
  • Confirm osquery instance is fully set up: instead of only validating that runner.Healthy() returns no error, also confirm that the launch function has finished all critical work, and add a little bit of a sleep buffer before proceeding
  • Test failures include more useful troubleshooting information: include error message from calling runner.Healthy when that fails; include runner logs in failure messages

Code improvements

  • Retry PID file removal
  • A log message was erroneously tagged as an errgroup exiting when it was not -- updated appropriately
  • Mark instance as connected after all extension manager servers are set up, not just kolide_grpc

@RebeccaMahany RebeccaMahany force-pushed the becca/osq-flags-changed-test branch from 12497d4 to b417a05 Compare July 25, 2024 15:39
@RebeccaMahany RebeccaMahany force-pushed the becca/osq-flags-changed-test branch from b417a05 to d3e05db Compare July 25, 2024 15:48
@RebeccaMahany RebeccaMahany changed the title pkg/osquery/runtime TestFlagsChanged pkg/osquery/runtime tests Jul 25, 2024
@RebeccaMahany RebeccaMahany changed the title pkg/osquery/runtime tests pkg/osquery/runtime test improvements Jul 25, 2024
@RebeccaMahany RebeccaMahany changed the title pkg/osquery/runtime test improvements pkg/osquery/runtime improvements, largely around improving test flakiness Jul 25, 2024
@RebeccaMahany RebeccaMahany marked this pull request as ready for review July 25, 2024 17:29
Copy link
Contributor

@James-Pickett James-Pickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@RebeccaMahany RebeccaMahany added this pull request to the merge queue Jul 25, 2024
Merged via the queue into kolide:main with commit 38af721 Jul 25, 2024
29 checks passed
@RebeccaMahany RebeccaMahany deleted the becca/osq-flags-changed-test branch July 25, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants