Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating osquery-perf to enable load testing with Spot Instances #14782

Closed
jahzielv opened this issue Oct 27, 2023 · 16 comments
Closed

Updating osquery-perf to enable load testing with Spot Instances #14782

jahzielv opened this issue Oct 27, 2023 · 16 comments
Labels
~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-endpoint-ops Endpoint ops product group story A user story defining an entire feature

Comments

@jahzielv
Copy link
Contributor

jahzielv commented Oct 27, 2023

Problem

We currently have a loadtesting process we use to loadtest features that might have an impact on large-scale users. This process has been very effective at finding bugs and is a critical part of ensuring quality releases, but there are some downsides to it. One of these downsides is cost. Our loadtesting cloud env can be quite expensive to run, as it is very close to a production deployment with 100k+ hosts.

Potential solutions

If we switch to using EC2 Spot Instances, we could save quite a bit of money. Spot Instances are EC2 instances that use leftover resources in AWS' system. This means that they are much cheaper than dedicated instances, but they can be brought offline by AWS at any moment. Currently, the osquery-perf process is stateful, but doesn't save its state anywhere. So when an instance crashes (or, with Spot Instances, is brought offline by AWS), we'd lose that state. If we can update osquery-perf to save and reuse state, we'd be able to deploy onto Spot Instances.

In general, we'd need:

  • A way to generate large amounts of fake host configurations for testing
  • A place to store these configs (probably S3)
  • To update osquery-perf to read in configs and use them to emulate those hosts
  • To update the loadtesting terraform and potentially other AWS/infrastructure changes as well, we'd need @rfairburn 's expertise on this I'm sure
@jahzielv jahzielv added the ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. label Oct 27, 2023
@sharon-fdm sharon-fdm added story A user story defining an entire feature #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) labels Oct 27, 2023
@noahtalerman noahtalerman added ~feature fest Will be reviewed at next Feature Fest and removed :product Product Design department (shows up on 🦢 Drafting board) labels Oct 27, 2023
@noahtalerman
Copy link
Member

@sharon-fdm can you please bring eng initiated stories to feature fest instead of adding them to the drafting board? We want to weigh this with other stories before we prioritize it.

@sharon-fdm
Copy link
Collaborator

My bad @noahtalerman.
Will do.

@noahtalerman
Copy link
Member

Feature fest: What's the difference in cost? Let's figure that out first.

@noahtalerman noahtalerman removed the ~feature fest Will be reviewed at next Feature Fest label Nov 2, 2023
@noahtalerman
Copy link
Member

@jahzielv heads up, this didn't get prioritized during feature fest.

@jahzielv
Copy link
Contributor Author

jahzielv commented Nov 3, 2023

@noahtalerman gotcha! For next steps, should I try to get a detailed estimate on cost savings?

@noahtalerman
Copy link
Member

, should I try to get a detailed estimate on cost savings?

@jahzielv I don't think we need to jump on this now but maybe this is candidate as something to do during on-call...

@lukeheath @sharon-fdm what do you think?

@lukeheath
Copy link
Member

Agreed this looks like a great on-call activity.

@lukeheath lukeheath changed the title Updating osquery-perf to enable loadtesting with Spot Instances Updating osquery-perf to enable load testing with Spot Instances Dec 10, 2024
@lukeheath
Copy link
Member

@rfairburn Would you please review and let me know if you think there's opportunity to reduce load test costs by using spot instances? Thanks!

@rfairburn
Copy link
Contributor

If we had a way to persist the state that containers load, then fargate-spot for the osquery-perf containers would be ideal.

@rfairburn
Copy link
Contributor

While the ECS costs in any instance are not the biggest cost component, we could save up to 70% on the osquery-perf containers. I will look at the billing and see how much that would be.

@rfairburn
Copy link
Contributor

@noahtalerman @lukeheath in the month of November, the we spent roughly $10k in load testing. Of that ECS/Fargate was around 11% of the total costs (between 1000 and 1100). My guess is that at best we save around 40-50% of that switching the loadtesting containers to spot instances assuming we want to keep the Fleet containers on-demand. Perhaps even a little less depending on what the current demand puts spot instance costs at.

So my guess is that it would be roughly a net $400/month savings or so.

Right now the amount of logging we do in Cloudwatch accounts for around 50% of the costs. My guess much of that is what is coming from the osquery-perf containers. Not sure if that can really be helped assuming we want to have debug data when it comes time to troubleshoot.

Optionally we could default to routing osquery-perf logs to /dev/null until we decide we need them (and then re-apply with a config that has better logging). Not sure the extra time for the Dev/QA engineer to wait for a re-apply would be justified, however. Any loss of productivity on an already-slow process is probably far more costly overall than the logging costs in AWS.

There could be some kind of compromise in between we could look at.

@lukeheath
Copy link
Member

@rfairburn Thanks for digging into it. Sounds like the cost savings aren't enough to pursue spot instances right now. But maybe reducing logs.

@jahzielv Do we typically need the osquery-perf logs for load testing?

@jahzielv
Copy link
Contributor Author

Do we typically need the osquery-perf logs for load testing?

@lukeheath I believe so; they contain helpful information about osquery-perf's behavior (simulated failure rates for example). However, they do log quite often, so maybe we can make them less chatty/increase the time between logs?

However, I've haven't done a load test in a hot minute, so looping some other folks that might know better from more recent experience: @mna or @getvictor , have y'all used those logs from osquery-perf much?

@getvictor
Copy link
Member

Do we typically need the osquery-perf logs for load testing?

@lukeheath I believe so; they contain helpful information about osquery-perf's behavior (simulated failure rates for example). However, they do log quite often, so maybe we can make them less chatty/increase the time between logs?

However, I've haven't done a load test in a hot minute, so looping some other folks that might know better from more recent experience: @mna or @getvictor , have y'all used those logs from osquery-perf much?

Generally, we only need osquery-perf logs for debug. For example, recently I found an issue with osquery-perf crashing with the help of logs: #24381

@lukeheath
Copy link
Member

Got it. Let's keep it in place now since the spin up and down takes so long, but we're working on finding faster ways to handle that.

@fleet-release
Copy link
Contributor

Testing in the cloud, high,
Spot Instances save the cost,
Fleet flies, bugs are lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-endpoint-ops Endpoint ops product group story A user story defining an entire feature
Development

No branches or pull requests

7 participants