Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load tests with provider-aws #576

Closed
ulucinar opened this issue Feb 28, 2023 · 16 comments
Closed

Load tests with provider-aws #576

ulucinar opened this issue Feb 28, 2023 · 16 comments
Assignees

Comments

@ulucinar
Copy link
Collaborator

ulucinar commented Feb 28, 2023

In the context of #325, we would like to perform some load tests to better understand the scaling characteristics of the provider. The most recent experiments related to provider performance are here but they were for parameter optimization and not load test experiments. These tests can also help us to give the community sizing & scaling guidance.

We may do a set of experiments in which we gradually increase the # of MRs provisioned until we saturate the computing resources of upbound/provider-aws. I suggest we use a GCP regional cluster with a worker instance type of e2-standard-32 initially with the vanilla provider and with the default parameters (especially with the default max-reconcile-rate of 10) as suggested here) so that we can better relate our results with the results of those previous ones and also because the current default provider parameters are chosen using the results of those experiments.

We can also make use of the existing tooling from here & here to conduct these tests. We should collect & report at least the following for each experiment:

  • The types and number of MRs provisioned during the test
  • Success rate for Ready=True, Synced=True state in 10 min: During an interval of 10 min, how many of the MRs could acquire these conditions and how many failed to do so?
  • Using the available Prometheus metrics from the provider, what was the peak & avg. memory/CPU utilization? You can install the Prometheus and Grafana stack using something like: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n prometheus --set namespaceOverride=prometheus --set grafana.namespaceOverride=prometheus --set kube-state-metrics.namespaceOverride=prometheus --set prometheus-node-exporter.namespaceOverride=prometheus --create-namespace from the prometheus-community Helm repository (helm repo add prometheus-community https://prometheus-community.github.io/helm-charts). We may include the Grafana dashboard screenshots like here.
  • kubectl get managed -o yaml output at the end of the experiment.
  • Time-to-readiness metrics as defined here. Histograms like we have there would be great but we can also derive them later.
  • go run github.com/upbound/uptest/cmd/ttr@fix-69 output (related with the above item)
  • ps -o pid,ppid,etime,comm,args output from the provider container. We can do this at the end of each experiment run or better, we can have reporting during the course of the experiment with something like: while true; do date; k exec -it <provider pod> -- ps -o pid,ppid,etime,comm,args; done and log the output to a file. You can refer to our conversion with @mmclane here for more context on why we do this.

As long as we have not saturated the compute resources of the provider, we can iterate with a new experiment with more MRs in increments of 5 or 10. I think initially we can start with 30 (let's start with something with 100% success rate, i.e., all MRs provisioned can become ready in the allocated time, i.e., in 10 min).

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 1, 2023

In the context of this issue, I wanted to provision Kubernetes clusters in GKE. I had two tries: Zonal and Regional. I will share my observations on them.

Zonal Cluster: us-central1-c | 3 worker nodes | e2-standard-2 | 2 vCPU, 8 GB memory | 24 GB memory total | K8s version is 1.25.6

  • Provisioned this cluster.
  • Started installation of the provider-aws v0.30.0 (latest)
  • Then observed some TLS handshake errors in queries. (14.30 TRT)
  • After a while, the cluster switched to the Repairing state. This was repeated several times.
  • It took about more than two hours for the cluster to fully come out of this process.

Regional Cluster: us-west1 | 3 node | e2-standard-2 | 2 vCPU, 8 GB memory | 24 GB memory total | K8s version is 1.25.6

  • Provisioned this cluster.
  • Started installation of the provider-aws v0.30.0 (latest)
  • Then observed some TLS handshake errors in queries. (21.40 TRT)
  • After a while, the cluster switched to the Repairing state. This was repeated several times.
  • It took about more than one hour for the cluster to fully come out of this process.

This is an important point for CRD scaling. From the user's perspective, with not-bad machines, it took about more than one or two hours for the provider to set up and become usable. Also, after the cluster was stable, crossplane, crossplane-rbac, and aws provider pod have been restarted many times.

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 2, 2023

In AWS cluster the situation is noticeably better. The duration for stabilization is 5-7 mins.

us-west1 | 2 node | t3.medium | 2 vCPU, 4 GB memory | 8 GB memory total | K8s version is 1.25

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 2, 2023

Another observation is that, after deletion of the finalizers of resources, I observed that the memory consumption is increasing:

Performance Observation 1

This graph shows memory consumption of provider-aws pod
This test was done with only one ECR Repository MR

@Piotr1215
Copy link
Contributor

@sergenyalcin for GCP clusters I remember Nic mentioning that it requires 11 nodes to become stable, see the relevant slack thread

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 3, 2023

Hi folks,
I've added the:
go run github.com/upbound/uptest/cmd/ttr@fix-69 output
step to our test protocol (please see this issue's description) so that we keep a record of the TTR measurements of our managed resources during the experiments. The measurement tool's PR has not been merged yet so we may run it from a feature branch for now.

@sergenyalcin
Copy link
Collaborator

I did some load tests with provider-aws v0.30.0 in an EKS cluster.

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 8 vCPUs

Test Results:

image

We observe that memory usage does not increase after the CPU is saturated. An increase in TTRs is being recorded as expected. However, the provider continues to do its job. Therefore, although TTR is acceptable for 70 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively.

On the other hand, no zombie processes were found in the tests. Log files for all tests are attached.

ps10.log ps20.log ps30.log ps50.log ps60.log ps70.log


Experiment: Deploying the mentioned number of MRs to the cluster, and deleting the related resources from the cluster (deleting the finalizers and deleting the physical resource is meant).

Experiment Time: The time elapsed between deploying and removing the mentioned number of MRs to the cluster.

TTR: The time elapsed between the deployment of MR to the cluster and the Ready condition being True.

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 6, 2023

Hi @sergenyalcin,
Here are the packages which enable the shared server runtime on top of v0.30.0 for testing:

  • ulucinar/provider-aws-amd64:v0.30.0-f714e11d
  • ulucinar/provider-aws-arm64:v0.30.0-f714e11d

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 7, 2023

I did some load tests with provider-aws v0.30.0 in a bigger (in the context of CPU) EKS cluster. In the previous test, because of the CPU was saturated, after a point, we did not observe any memory increase. So, we switched to a bigger machine.

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs

Test Results:

image

We did not observe CPU saturation but it did not increase after a point. We think that the reason is related to parallelism parameter of provider. When we reach this parallelism constraint, we cannot saturate the CPU.

The main observation, again the provider continues to do its job. Therefore, although TTR is acceptable for 150 MR, the provider did not stop working and provisioned all resources properly. So, it can be said that the provider is working effectively.

On the other hand, still, I did not observe the zombie processes. Log files for all tests are attached.

ps50.log ps75.log ps85.log ps100.log ps150.log

@sergenyalcin
Copy link
Collaborator

I did some tests by using the GRPC server enabled image that was prepared by @ulucinar

Test Environment: EKS Cluster - m5.2xlarge - 32 GB Memory - 16 vCPUs

Test Results:

image

The memory consumption is higher. I synced with @ulucinar offline, and he mentioned an upstream memory leak issue in the context of GRPC server enabled system.

On the other hand, there is a significant improvement in the TTR values. And also we successfully created 1000 MRs.

I also ran the 1000 MR tests in the v0.30.0 image. Results:

image

There is a significant difference in the provider performance (Please see TTR values).

So, from these results, I think, switching to the GRPC server enabled implementation with resolving the leak issue will be the best solution in the context of performance metrics.

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 9, 2023

Hi @sergenyalcin,
Thank you for the tests, they are extremely insightful.

Here's the upstream issue that's probably related to the high memory consumption you observe with the shared server runtime:
hashicorp/terraform-provider-aws#26130

@negz
Copy link
Member

negz commented Mar 16, 2023

I did some tests by using the GRPC server enabled image that was prepared by @ulucinar

I want to make sure I'm understanding these results correctly. Am I correct that our most efficient (i.e. most optimized) build of upbound/provider-aws uses most of 16 vCPUs and 3GB of memory to reconcile 1,000 managed resources?

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 17, 2023

Test results from the latest image from @ulucinar. (Shared Provider Scheduler) This image contains:

image

According to the latest results:

  • In the context of both two metrics, Peak Memory Consumption and TTR, there are significant improvements.
  • In average memory consumptions (these values are not in the tables), we also improvement.

With these results we can say that the most successful image is this one. I am putting this image here as a reference:

  • ulucinar/provider-aws-amd64:v0.31.0-08222dae
  • ulucinar/provider-aws-arm64:v0.31.0-08222dae

Please not that, these images are on the v0.31.0 version of the provider-aws.

@sergenyalcin
Copy link
Collaborator

Compared to baseline, some of the improvement rates in the final image were:

  • TTR: % 67
  • Peak Memory: % 11
  • Average Memory: % 27
  • Peak CPU: % 12

@sergenyalcin
Copy link
Collaborator

The test results for the Workspace Scheduler (for more context please see crossplane/upjet#178)

image

When we compare the results with the baseline:

TTR: % 61
Peak Memory: No improvement - Same results
Average Memory: No improvement - Same results
Peak CPU: % 6
Average CPU: % 8

@sergenyalcin
Copy link
Collaborator

sergenyalcin commented Mar 27, 2023

Test results from the latest image from crossplane/upjet#178
Image: ulucinar/provider-aws-amd64:v0.31.0-ca2f21ca

image

After the latest changes, the shared scheduler implementation has the same performance results.

image

@sergenyalcin
Copy link
Collaborator

Many experiments have been done to determine the performance characteristic of provider-aws in the context of baseline and different schedulers. In addition, load tests were performed using a large number of MR and the results were recorded. That's why I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants