Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries may be slow to run on Windows in the first ~5-10 minutes after launcher startup #1784

Open
RebeccaMahany opened this issue Jul 17, 2024 · 4 comments
Assignees

Comments

@RebeccaMahany
Copy link
Contributor

RebeccaMahany commented Jul 17, 2024

The automated tests are flaky on Windows right now specifically because launcher does not receive and process a live query within 5 minutes of osquery starting on launcher startup. I've seen this issue pretty consistently in the tests, and now have seen a report of an issue that seems pretty similar -- I think this is worth investigating.

@directionless
Copy link
Contributor

Maybe related to #1442

@RebeccaMahany
Copy link
Contributor Author

Ooh, interesting. I also noticed that this last report coincided with kolide_wmi logs, and I know we've flagged those as potentially not performant before.

@RebeccaMahany RebeccaMahany self-assigned this Aug 12, 2024
@RebeccaMahany
Copy link
Contributor Author

RebeccaMahany commented Aug 12, 2024

Findings thus far:

I added logging for slow-running queries in #1823. What I found aligned with what seph saw when looking at Honeycomb -- that it doesn't seem to be an issue with the queries themselves. Even very simple queries could be very slow. Also, typically the wall_time was high, but the system/user time was negligible.

We theorized that the test machines themselves might be struggling, so I increased the size of the test VMs. However, this had no effect -- the test VMs still took an incredibly long time to move through their distributed queue.

I looked at related traces this morning -- code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).GetQueries and code.function:github.com/kolide/launcher/pkg/osquery.(*Extension).WriteResults. I mostly saw that whenever these traces took a while, it was almost entirely during the portion where they communicate with K2. However, that still wasn't consistently slow enough to explain this overall issue.

@RebeccaMahany
Copy link
Contributor Author

I tested to see if distributed query results are slow to be processed by K2, since we're also seeing an issue where the live queries created by the tests never have their results processed by K2. It's possible the distributed query results are slow to be processed by K2, but they do get processed by K2 -- and we aren't seeing queue delays on the K2 side. This doesn't seem to be the explanation we're looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants