Updating `osquery-perf` to enable load testing with Spot Instances #14782

jahzielv · 2023-10-27T18:18:30Z

Problem

We currently have a loadtesting process we use to loadtest features that might have an impact on large-scale users. This process has been very effective at finding bugs and is a critical part of ensuring quality releases, but there are some downsides to it. One of these downsides is cost. Our loadtesting cloud env can be quite expensive to run, as it is very close to a production deployment with 100k+ hosts.

Potential solutions

If we switch to using EC2 Spot Instances, we could save quite a bit of money. Spot Instances are EC2 instances that use leftover resources in AWS' system. This means that they are much cheaper than dedicated instances, but they can be brought offline by AWS at any moment. Currently, the osquery-perf process is stateful, but doesn't save its state anywhere. So when an instance crashes (or, with Spot Instances, is brought offline by AWS), we'd lose that state. If we can update osquery-perf to save and reuse state, we'd be able to deploy onto Spot Instances.

In general, we'd need:

A way to generate large amounts of fake host configurations for testing
A place to store these configs (probably S3)
To update osquery-perf to read in configs and use them to emulate those hosts
To update the loadtesting terraform and potentially other AWS/infrastructure changes as well, we'd need @rfairburn 's expertise on this I'm sure

The text was updated successfully, but these errors were encountered:

noahtalerman · 2023-10-27T20:26:00Z

@sharon-fdm can you please bring eng initiated stories to feature fest instead of adding them to the drafting board? We want to weigh this with other stories before we prioritize it.

sharon-fdm · 2023-10-27T20:27:03Z

My bad @noahtalerman.
Will do.

noahtalerman · 2023-11-02T19:09:22Z

Feature fest: What's the difference in cost? Let's figure that out first.

noahtalerman · 2023-11-02T21:32:10Z

@jahzielv heads up, this didn't get prioritized during feature fest.

jahzielv · 2023-11-03T14:29:37Z

@noahtalerman gotcha! For next steps, should I try to get a detailed estimate on cost savings?

noahtalerman · 2023-11-07T20:43:26Z

, should I try to get a detailed estimate on cost savings?

@jahzielv I don't think we need to jump on this now but maybe this is candidate as something to do during on-call...

@lukeheath @sharon-fdm what do you think?

lukeheath · 2023-11-07T21:07:24Z

Agreed this looks like a great on-call activity.

lukeheath · 2024-12-10T22:27:39Z

@rfairburn Would you please review and let me know if you think there's opportunity to reduce load test costs by using spot instances? Thanks!

rfairburn · 2024-12-10T22:38:52Z

If we had a way to persist the state that containers load, then fargate-spot for the osquery-perf containers would be ideal.

rfairburn · 2024-12-10T22:40:38Z

While the ECS costs in any instance are not the biggest cost component, we could save up to 70% on the osquery-perf containers. I will look at the billing and see how much that would be.

rfairburn · 2024-12-12T23:43:14Z

@noahtalerman @lukeheath in the month of November, the we spent roughly $10k in load testing. Of that ECS/Fargate was around 11% of the total costs (between 1000 and 1100). My guess is that at best we save around 40-50% of that switching the loadtesting containers to spot instances assuming we want to keep the Fleet containers on-demand. Perhaps even a little less depending on what the current demand puts spot instance costs at.

So my guess is that it would be roughly a net $400/month savings or so.

Right now the amount of logging we do in Cloudwatch accounts for around 50% of the costs. My guess much of that is what is coming from the osquery-perf containers. Not sure if that can really be helped assuming we want to have debug data when it comes time to troubleshoot.

Optionally we could default to routing osquery-perf logs to /dev/null until we decide we need them (and then re-apply with a config that has better logging). Not sure the extra time for the Dev/QA engineer to wait for a re-apply would be justified, however. Any loss of productivity on an already-slow process is probably far more costly overall than the logging costs in AWS.

There could be some kind of compromise in between we could look at.

lukeheath · 2024-12-13T20:56:36Z

@rfairburn Thanks for digging into it. Sounds like the cost savings aren't enough to pursue spot instances right now. But maybe reducing logs.

@jahzielv Do we typically need the osquery-perf logs for load testing?

jahzielv · 2024-12-13T21:17:13Z

Do we typically need the osquery-perf logs for load testing?

@lukeheath I believe so; they contain helpful information about osquery-perf's behavior (simulated failure rates for example). However, they do log quite often, so maybe we can make them less chatty/increase the time between logs?

However, I've haven't done a load test in a hot minute, so looping some other folks that might know better from more recent experience: @mna or @getvictor , have y'all used those logs from osquery-perf much?

getvictor · 2024-12-13T21:43:59Z

Do we typically need the osquery-perf logs for load testing?

@lukeheath I believe so; they contain helpful information about osquery-perf's behavior (simulated failure rates for example). However, they do log quite often, so maybe we can make them less chatty/increase the time between logs?

However, I've haven't done a load test in a hot minute, so looping some other folks that might know better from more recent experience: @mna or @getvictor , have y'all used those logs from osquery-perf much?

Generally, we only need osquery-perf logs for debug. For example, recently I found an issue with osquery-perf crashing with the help of logs: #24381

lukeheath · 2024-12-13T21:53:04Z

Got it. Let's keep it in place now since the spin up and down takes so long, but we're working on finding faster ways to handle that.

fleet-release · 2024-12-13T21:53:07Z

Testing in the cloud, high,
Spot Instances save the cost,
Fleet flies, bugs are lost.

jahzielv added the ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. label Oct 27, 2023

sharon-fdm added story A user story defining an entire feature #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) labels Oct 27, 2023

noahtalerman added ~feature fest Will be reviewed at next Feature Fest and removed :product Product Design department (shows up on 🦢 Drafting board) labels Oct 27, 2023

noahtalerman removed the ~feature fest Will be reviewed at next Feature Fest label Nov 2, 2023

lukeheath changed the title ~~Updating osquery-perf to enable loadtesting with Spot Instances~~ Updating osquery-perf to enable load testing with Spot Instances Dec 10, 2024

lukeheath closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating `osquery-perf` to enable load testing with Spot Instances #14782

Updating `osquery-perf` to enable load testing with Spot Instances #14782

jahzielv commented Oct 27, 2023 •

edited

Loading

noahtalerman commented Oct 27, 2023

sharon-fdm commented Oct 27, 2023

noahtalerman commented Nov 2, 2023

noahtalerman commented Nov 2, 2023

jahzielv commented Nov 3, 2023

noahtalerman commented Nov 7, 2023

lukeheath commented Nov 7, 2023

lukeheath commented Dec 10, 2024

rfairburn commented Dec 10, 2024

rfairburn commented Dec 10, 2024

rfairburn commented Dec 12, 2024

lukeheath commented Dec 13, 2024

jahzielv commented Dec 13, 2024

getvictor commented Dec 13, 2024

lukeheath commented Dec 13, 2024

fleet-release commented Dec 13, 2024

Updating osquery-perf to enable load testing with Spot Instances #14782

Updating osquery-perf to enable load testing with Spot Instances #14782

Comments

jahzielv commented Oct 27, 2023 • edited Loading

Problem

Potential solutions

noahtalerman commented Oct 27, 2023

sharon-fdm commented Oct 27, 2023

noahtalerman commented Nov 2, 2023

noahtalerman commented Nov 2, 2023

jahzielv commented Nov 3, 2023

noahtalerman commented Nov 7, 2023

lukeheath commented Nov 7, 2023

lukeheath commented Dec 10, 2024

rfairburn commented Dec 10, 2024

rfairburn commented Dec 10, 2024

rfairburn commented Dec 12, 2024

lukeheath commented Dec 13, 2024

jahzielv commented Dec 13, 2024

getvictor commented Dec 13, 2024

lukeheath commented Dec 13, 2024

fleet-release commented Dec 13, 2024

Updating `osquery-perf` to enable load testing with Spot Instances #14782

Updating `osquery-perf` to enable load testing with Spot Instances #14782

jahzielv commented Oct 27, 2023 •

edited

Loading