Load tests with provider-aws #576

ulucinar · 2023-02-28T09:49:58Z

In the context of #325, we would like to perform some load tests to better understand the scaling characteristics of the provider. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-aws. I suggest we use a GCP regional cluster with a worker instance type of e2-standard-32 initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

The types and number of MRs provisioned during the test
Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository (helm repo add prometheus-community https://prometheus-community.github.io/helm-charts). We may include the Grafana dashboard screenshots like here.
kubectl get managed -o yaml output at the end of the experiment.
Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

The text was updated successfully, but these errors were encountered:

sergenyalcin · 2023-03-01T19:31:46Z

In the context of this issue, I wanted to provision Kubernetes clusters in GKE. I had two tries: Zonal and Regional. I will share my observations on them.

Provisioned this cluster.
Started installation of the provider-aws v0.30.0 (latest)
Then observed some TLS handshake errors in queries. (14.30 TRT)
After a while, the cluster switched to the Repairing state. This was repeated several times.
It took about more than two hours for the cluster to fully come out of this process.

Provisioned this cluster.
Started installation of the provider-aws v0.30.0 (latest)
Then observed some TLS handshake errors in queries. (21.40 TRT)
After a while, the cluster switched to the Repairing state. This was repeated several times.
It took about more than one hour for the cluster to fully come out of this process.

This is an important point for CRD scaling. From the user's perspective, with not-bad machines, it took about more than one or two hours for the provider to set up and become usable. Also, after the cluster was stable, crossplane, crossplane-rbac, and aws provider pod have been restarted many times.

sergenyalcin · 2023-03-02T03:07:48Z

In AWS cluster the situation is noticeably better. The duration for stabilization is 5-7 mins.

sergenyalcin · 2023-03-02T09:11:01Z

Another observation is that, after deletion of the finalizers of resources, I observed that the memory consumption is increasing:

This graph shows memory consumption of provider-aws pod
This test was done with only one ECR Repository MR

Piotr1215 · 2023-03-03T11:35:39Z

@sergenyalcin for GCP clusters I remember Nic mentioning that it requires 11 nodes to become stable, see the relevant slack thread

ulucinar · 2023-03-03T12:45:39Z

Hi folks,
I've added the:
go run github.com/upbound/uptest/cmd/ttr@fix-69 output
step to our test protocol (please see this issue's description) so that we keep a record of the TTR measurements of our managed resources during the experiments. The measurement tool's PR has not been merged yet so we may run it from a feature branch for now.

sergenyalcin · 2023-03-05T20:35:16Z

I did some load tests with provider-aws v0.30.0 in an EKS cluster.

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 8 vCPUs

Test Results:

We observe that memory usage does not increase after the CPU is saturated. An increase in TTRs is being recorded as expected. However, the provider continues to do its job. Therefore, although TTR is acceptable for 70 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively.

On the other hand, no zombie processes were found in the tests. Log files for all tests are attached.

ps10.log ps20.log ps30.log ps50.log ps60.log ps70.log

Experiment: Deploying the mentioned number of MRs to the cluster, and deleting the related resources from the cluster (deleting the finalizers and deleting the physical resource is meant).

Experiment Time: The time elapsed between deploying and removing the mentioned number of MRs to the cluster.

TTR: The time elapsed between the deployment of MR to the cluster and the Ready condition being True.

ulucinar · 2023-03-06T19:23:49Z

Hi @sergenyalcin,
Here are the packages which enable the shared server runtime on top of v0.30.0 for testing:

ulucinar/provider-aws-amd64:v0.30.0-f714e11d
ulucinar/provider-aws-arm64:v0.30.0-f714e11d

sergenyalcin · 2023-03-07T07:39:42Z

I did some load tests with provider-aws v0.30.0 in a bigger (in the context of CPU) EKS cluster. In the previous test, because of the CPU was saturated, after a point, we did not observe any memory increase. So, we switched to a bigger machine.

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs

Test Results:

We did not observe CPU saturation but it did not increase after a point. We think that the reason is related to parallelism parameter of provider. When we reach this parallelism constraint, we cannot saturate the CPU.

The main observation, again the provider continues to do its job. Therefore, although TTR is acceptable for 150 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively.

On the other hand, still, I did not observe the zombie processes. Log files for all tests are attached.

ps50.log ps75.log ps85.log ps100.log ps150.log

sergenyalcin · 2023-03-09T08:08:30Z

I did some tests by using the GRPC server enabled image that was prepared by @ulucinar

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs

Test Results:

The memory consumption is higher. I synced with @ulucinar offline, and he mentioned an upstream memory leak issue in the context of GRPC server enabled system.

On the other hand, there is a significant improvement in the TTR values. And also we successfully created 1000 MRs.

I also ran the 1000 MR tests in the v0.30.0 image. Results:

There is a significant difference in the provider performance (Please see TTR values).

So, from these results, I think, switching to the GRPC server enabled implementation with resolving the leak issue will be the best solution in the context of performance metrics.

ulucinar · 2023-03-09T09:00:26Z

Hi @sergenyalcin,
Thank you for the tests, they are extremely insightful.

Here's the upstream issue that's probably related to the high memory consumption you observe with the shared server runtime:
hashicorp/terraform-provider-aws#26130

negz · 2023-03-16T18:20:40Z

I did some tests by using the GRPC server enabled image that was prepared by @ulucinar

I want to make sure I'm understanding these results correctly. Am I correct that our most efficient (i.e. most optimized) build of upbound/provider-aws uses most of 16 vCPUs and 3GB of memory to reconcile 1,000 managed resources?

sergenyalcin · 2023-03-17T17:03:10Z

Test results from the latest image from @ulucinar. (Shared Provider Scheduler) This image contains:

GRPC Server implementation
Terraform related memory leak fix (Memory leaks hashicorp/terraform-provider-aws#26130)
An hack (not a final solution) about IRSA issue (please see this thread)

According to the latest results:

In the context of both two metrics, Peak Memory Consumption and TTR, there are significant improvements.
In average memory consumptions (these values are not in the tables), we also improvement.

With these results we can say that the most successful image is this one. I am putting this image here as a reference:

ulucinar/provider-aws-amd64:v0.31.0-08222dae
ulucinar/provider-aws-arm64:v0.31.0-08222dae

Please not that, these images are on the v0.31.0 version of the provider-aws.

sergenyalcin · 2023-03-20T14:02:48Z

Compared to baseline, some of the improvement rates in the final image were:

TTR: % 67
Peak Memory: % 11
Average Memory: % 27
Peak CPU: % 12

sergenyalcin · 2023-03-24T12:50:34Z

The test results for the Workspace Scheduler (for more context please see crossplane/upjet#178)

When we compare the results with the baseline:

TTR: % 61
Peak Memory: No improvement - Same results
Average Memory: No improvement - Same results
Peak CPU: % 6
Average CPU: % 8

sergenyalcin · 2023-03-27T14:50:51Z

Test results from the latest image from crossplane/upjet#178
Image: ulucinar/provider-aws-amd64:v0.31.0-ca2f21ca

After the latest changes, the shared scheduler implementation has the same performance results.

sergenyalcin · 2023-04-11T16:47:12Z

Many experiments have been done to determine the performance characteristic of provider-aws in the context of baseline and different schedulers. In addition, load tests were performed using a large number of MR and the results were recorded. That's why I'm closing the issue.

ulucinar assigned sergenyalcin Feb 28, 2023

jeanduplessis modified the milestone: Full Provider Coverage Mar 3, 2023

Piotr1215 mentioned this issue Mar 3, 2023

Load tests with provider-azure crossplane-contrib/provider-upjet-azure#404

Closed

ulucinar mentioned this issue Mar 22, 2023

Add terraform.ProviderScheduler crossplane/upjet#178

Merged

3 tasks

ulucinar mentioned this issue Mar 27, 2023

Consume upjet ProviderScheduler #627

Merged

1 task

jeanduplessis mentioned this issue Mar 28, 2023

Dramatic increase in CPU and memory consumption / performance issues #325

Closed

sergenyalcin closed this as completed Apr 11, 2023

turkenf mentioned this issue Sep 12, 2023

Pod terminated with OOMKilled #801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load tests with provider-aws #576

Load tests with provider-aws #576

ulucinar commented Feb 28, 2023 •

edited

Loading

sergenyalcin commented Mar 1, 2023 •

edited

Loading

sergenyalcin commented Mar 2, 2023 •

edited

Loading

sergenyalcin commented Mar 2, 2023 •

edited

Loading

Piotr1215 commented Mar 3, 2023

ulucinar commented Mar 3, 2023 •

edited

Loading

sergenyalcin commented Mar 5, 2023

ulucinar commented Mar 6, 2023

sergenyalcin commented Mar 7, 2023 •

edited

Loading

sergenyalcin commented Mar 9, 2023

ulucinar commented Mar 9, 2023

negz commented Mar 16, 2023

sergenyalcin commented Mar 17, 2023 •

edited

Loading

sergenyalcin commented Mar 20, 2023

sergenyalcin commented Mar 24, 2023

sergenyalcin commented Mar 27, 2023 •

edited

Loading

sergenyalcin commented Apr 11, 2023

Load tests with provider-aws #576

Load tests with provider-aws #576

Comments

ulucinar commented Feb 28, 2023 • edited Loading

sergenyalcin commented Mar 1, 2023 • edited Loading

sergenyalcin commented Mar 2, 2023 • edited Loading

sergenyalcin commented Mar 2, 2023 • edited Loading

Piotr1215 commented Mar 3, 2023

ulucinar commented Mar 3, 2023 • edited Loading

sergenyalcin commented Mar 5, 2023

ulucinar commented Mar 6, 2023

sergenyalcin commented Mar 7, 2023 • edited Loading

sergenyalcin commented Mar 9, 2023

ulucinar commented Mar 9, 2023

negz commented Mar 16, 2023

sergenyalcin commented Mar 17, 2023 • edited Loading

sergenyalcin commented Mar 20, 2023

sergenyalcin commented Mar 24, 2023

sergenyalcin commented Mar 27, 2023 • edited Loading

sergenyalcin commented Apr 11, 2023

ulucinar commented Feb 28, 2023 •

edited

Loading

sergenyalcin commented Mar 1, 2023 •

edited

Loading

sergenyalcin commented Mar 2, 2023 •

edited

Loading

sergenyalcin commented Mar 2, 2023 •

edited

Loading

ulucinar commented Mar 3, 2023 •

edited

Loading

sergenyalcin commented Mar 7, 2023 •

edited

Loading

sergenyalcin commented Mar 17, 2023 •

edited

Loading

sergenyalcin commented Mar 27, 2023 •

edited

Loading