-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating osquery-perf
to enable load testing with Spot Instances
#14782
Comments
@sharon-fdm can you please bring eng initiated stories to feature fest instead of adding them to the drafting board? We want to weigh this with other stories before we prioritize it. |
My bad @noahtalerman. |
Feature fest: What's the difference in cost? Let's figure that out first. |
@jahzielv heads up, this didn't get prioritized during feature fest. |
@noahtalerman gotcha! For next steps, should I try to get a detailed estimate on cost savings? |
@jahzielv I don't think we need to jump on this now but maybe this is candidate as something to do during on-call... @lukeheath @sharon-fdm what do you think? |
Agreed this looks like a great on-call activity. |
osquery-perf
to enable loadtesting with Spot Instancesosquery-perf
to enable load testing with Spot Instances
@rfairburn Would you please review and let me know if you think there's opportunity to reduce load test costs by using spot instances? Thanks! |
If we had a way to persist the state that containers load, then fargate-spot for the osquery-perf containers would be ideal. |
While the ECS costs in any instance are not the biggest cost component, we could save up to 70% on the osquery-perf containers. I will look at the billing and see how much that would be. |
@noahtalerman @lukeheath in the month of November, the we spent roughly $10k in load testing. Of that ECS/Fargate was around 11% of the total costs (between 1000 and 1100). My guess is that at best we save around 40-50% of that switching the loadtesting containers to spot instances assuming we want to keep the Fleet containers on-demand. Perhaps even a little less depending on what the current demand puts spot instance costs at. So my guess is that it would be roughly a net $400/month savings or so. Right now the amount of logging we do in Cloudwatch accounts for around 50% of the costs. My guess much of that is what is coming from the osquery-perf containers. Not sure if that can really be helped assuming we want to have debug data when it comes time to troubleshoot. Optionally we could default to routing osquery-perf logs to /dev/null until we decide we need them (and then re-apply with a config that has better logging). Not sure the extra time for the Dev/QA engineer to wait for a re-apply would be justified, however. Any loss of productivity on an already-slow process is probably far more costly overall than the logging costs in AWS. There could be some kind of compromise in between we could look at. |
@rfairburn Thanks for digging into it. Sounds like the cost savings aren't enough to pursue spot instances right now. But maybe reducing logs. @jahzielv Do we typically need the osquery-perf logs for load testing? |
@lukeheath I believe so; they contain helpful information about osquery-perf's behavior (simulated failure rates for example). However, they do log quite often, so maybe we can make them less chatty/increase the time between logs? However, I've haven't done a load test in a hot minute, so looping some other folks that might know better from more recent experience: @mna or @getvictor , have y'all used those logs from osquery-perf much? |
Generally, we only need osquery-perf logs for debug. For example, recently I found an issue with osquery-perf crashing with the help of logs: #24381 |
Got it. Let's keep it in place now since the spin up and down takes so long, but we're working on finding faster ways to handle that. |
Testing in the cloud, high, |
Problem
We currently have a loadtesting process we use to loadtest features that might have an impact on large-scale users. This process has been very effective at finding bugs and is a critical part of ensuring quality releases, but there are some downsides to it. One of these downsides is cost. Our loadtesting cloud env can be quite expensive to run, as it is very close to a production deployment with 100k+ hosts.
Potential solutions
If we switch to using EC2 Spot Instances, we could save quite a bit of money. Spot Instances are EC2 instances that use leftover resources in AWS' system. This means that they are much cheaper than dedicated instances, but they can be brought offline by AWS at any moment. Currently, the
osquery-perf
process is stateful, but doesn't save its state anywhere. So when an instance crashes (or, with Spot Instances, is brought offline by AWS), we'd lose that state. If we can updateosquery-perf
to save and reuse state, we'd be able to deploy onto Spot Instances.In general, we'd need:
osquery-perf
to read in configs and use them to emulate those hostsThe text was updated successfully, but these errors were encountered: