CPU spike fault not injecting #82

jv-frechstest · 2021-10-11T22:02:35Z

Describe the bug
Mangle is deployed on OpenShift container. Trying to inject CPU spike fault to clusterK8 endpoint service.
We are not able to spike the CPU to 90%. The CPU spikes very little compared to target we want.

To Reproduce
Steps to reproduce the behavior:

Go to 'Mangle UI'
Click on 'Fault Exception'
Click on 'CPU ',
Fill the Target cluster and JVM information, and put CPU to 100%
Go to Requests and Response. you will see the request is executed successfully.

Expected behavior
CPU spiked to 100% on target instance.

Screenshots
If applicable, add screenshots to help explain your problem.

Logs
If applicable, add application logs/troubleshooting bundle to help in root cause analysis.

Configuration information:

Deployment Type: [e.g. OVA, Container] - Target API on k8s Container
Deployment Mode: [e.g. Cluster, e] - Cluster
Client OS: [e.g. iOS] : Open Shift
Client Browser [e.g. chrome, safari, Swagger] - Chrome
Version [e.g. 22] 93

Additional context
Add any other context about the problem here.

ashrimalivmware · 2021-10-12T11:05:54Z

How much CPU spike you see after requesting 100%? How you are measuring the spike?

ashrimalivmware · 2021-12-17T11:00:09Z

How much CPU spike you see after requesting 100%? How you are measuring the spike?

HI @jv-frechstest can you please provide more details about it?

ashrimalivmware · 2021-12-22T06:52:28Z

Since there are no update closing the ticket, @jv-frechstest please re-open if required.

Anvesh42 · 2022-02-18T18:52:41Z

Hi. Our team has adopted the mangle product in our organization. Its running as a docker image on the Kubernetes cluster. We have having similar issues with CPU_SPIKE & MEMORY_SPIKE. When we inject the 100% chaos on wither CPU or MEMORY spike, we only see about 60% injection.
We did try few times reaching out to mangle vmware support at mangle@vmware.com but we haven't been able to see any response.
Is this something that the mangle team can help us with?

rpraveen-vmware · 2022-02-21T03:40:00Z

Hi @Anvesh42 ,
As mentioned in the earlier comment, can you please specify how you are measuring the spike.
As provided in details above, you are performing the Application level CPU/memory fault rite.
Do you observe the similar issue when you perform the Infrastructure level faults..

Anvesh42 · 2022-02-23T17:36:13Z

@rpraveen-vmware Thanks for the response. So let me sum up the challenges/issues our team is facing with mangle before getting into the execution details and further scenarios that were run.

Application Level CPU & MEMORY Faults
At the application level, these faults expects a JVM process ID as a mandatory input. We have a Kubernetes environment where the deployment of a POD can land on any node in the Kubernetes cluster. Now, if there are 5 POD's of a given microservice, the JVM PID will be different on each POD correct. In our case, we had 3 POD's with one JVM ID, lets say 15, and the other 2 with, lets say, 13. All these 5 POD's belong to the same microservice.
So now lets say if we were to inject the CPU spike fault on all the POD's of the microservice, foo, using the labels, app=foo, how do we overcome this obstacle? Based on the above scenario, we can only chose 13 or 15 as the JVM process.

How do we address the injection of CPU fault at the application level in this scenario? Our approach with all the faults has been to use the labels instead of a specific container so all the POD's matching that label are in play, instead of a specific container.

Infrastructure Level CPU & MEMORY Faults
In contrast to application level CPU fault, the infrastructure level CPU fault doesn't expect the JVM ID and other few arguments, but its at the infrastructure level impacting all the processes rather than just the microservice specific process. I am not clear on whether infrastructure level faults can be used in place of application level for the CPU & MEMORY.

I would like to hear your suggestions on this?

The spike is injected at the infra level with the infrastructure faults and at the specific JVM process with the application fault. It appears to me that application level CPU & MEMORY faults is more specific to a particular JVM process which is good. I mean, if there are, lets say, 5 different services running on a given node within the cluster, running infrastructure fault will impact all of them.

Measuring CPU & MEMORY Faults
In response to your question, we are using Grafana dashboard to measure the CPU spike. However we also use the top command to observe the spike in a given container

Your response is appreciated. Thanks!

rpraveen-vmware · 2022-02-24T06:05:50Z

@Anvesh42

For the Application level CPU and memory faults,
The mandatory parameter JVM process can be Process Id or the JVM process descriptor name.
Incase of the multiple pods, since the process Ids will be different, you can go for the second option.
Use "jps" command to list the java processes with its descriptor name.
eg:
jps
1296 Jps
1 LintApplication

This name (eg. LintApplication) will remain same for this java process across pods.
1. It depends on your usecase on specific testing.
  The Application level CPU/memory targets the specific JVM process. Hence, It gives the simulation of your running java process causing the CPU/ heap memory spikes.
Incase of Infrastructure CPU/memory fault, it increases the resources of your machine on a whole.
So, you can test your application which is hosted on the machine when the resource spikes happen ( simulation of caused by external factors on the machine).
1. You can use
  kubectl top pod POD_NAME --containers # Show metrics for a given pod and its containers
  kubectl top pod POD_NAME --sort-by=cpu # Show metrics for a given pod and sort it by 'cpu' or 'memory'
  when monitoring resource for pods.

Anvesh42 · 2022-02-25T15:12:19Z

@rpraveen-vmware Thanks for your inputs Praveen. I am working running these scenarios based on the above pointers.

Meanwhile, we are looking to upgrade mangle from 3.0 to 3.5. I have been told that 3.5 version has the Log4J issue remediation (an issue that happened very recently, few weeks ago)

https://hub.docker.com/layers/mangleuser/mangle/3.5.0/images/sha256-cc8d7c4542a86a942c046e118602db093efa7d7ba529f61845d761a75c1b6f9c?context=explore

I did find the image but I do not see info on the changelog i.e. what has been changed from 3.0 to 3.5. I am not sure if Log4J remediation changes have been added in this version.

Mind throwing some light? Or is there some other place where I could find the changelog?

Thanks
Anvesh

ashrimalivmware · 2022-02-28T04:05:39Z

Hi @Anvesh42 ,

Mangle 3.5 has following changes:

Integrate Dynatrace as a Metric Provides to Mangle
Enhance Network faults to have a varied latency for the entire timeout
Option to show all the resources of a K8S cluster and provide option to select the required resource for fault injection.
Add a new Fault for K8S Drain Nodes.
Log4j Vulnerability fix.
A much improved Real time polling.

Thanks,
-Avinash

Anvesh42 · 2022-02-28T20:08:43Z

@ashrimalivmware @rpraveen-vmware Thanks Avinash.

I did run the tests based on @rpraveen-vmware suggestion to use the jps instead of PID using the current version of mangle. While it did help to some extent & addressed the concern, here are the findings,

I had 4 POD's of a microservice running on namespace. I injected the CPU SPIKE chaos at the application level using the jps argument.

The CPU FAULT injected the spike across 3 POD's only out of 4
Why did the 4th POD miss out from this execution. It has the same label and the same jps value? (All 4 are replicas)

The injected percentage is still less than the user defined value. This was the primary issue of this thread.

The defined value was 80% and the injected value was 50%. Please find the attached snippet of the configuration
I re-ran the execution with 60% and 70% and still see only 50% injection.
The spike was measured using the top command in the container.

It would be nice to connect so we can work together and address/improve these issues.

P.S. We use Microsoft Teams in our environment. So we can connect there depending on your availability.

Appreciate your response!

Anvesh

ashrimalivmware · 2022-03-01T06:29:27Z

@Anvesh42 Yeah it would be better if we can connect, MS teams is fine with us. Please feel free to schedule a call, preferable timings would be post 8:30 AM IST and before 9:30 PM IST.

Anvesh42 · 2022-03-15T15:15:58Z

@ashrimalivmware @rpraveen-vmware Please find the attached OpenShift DC objects for sample namespaces, DEV03 & DEV70, that we used during the working session to test the CPU_FAULT spike scenarios.

DEV03 image properties:- RHEL:7.7-openjdk:1.8.0.232

DEV70 image properties:- RHEL:7.9-openjdk:1.8.0.292

We tested the following scenarios by modifying the resources section in each DC (Deployment Config) object.

NOTE:

Scenarios 1 & 2 are identical i.e. request is less than limit but with different values (millicore Vs. core)
Scenarios 3 & 4 are identical i.e. request is equals to limit but with different values (millicore Vs. core)

CPU request is less than the CPU limit

 - resources:
       limits:
           cpu: '500m'
           memory: 2Gi
        requests:
           cpu: '100m'
           memory: 512Mi

CPU request is less than the CPU limit

 - resources:
       limits:
           cpu: '1'
           memory: 2Gi
        requests:
           cpu: '200m'
           memory: 512Mi

CPU request is equal to the CPU limit (CPU request & limit both are equal to 1 core)

 - resources:
       limits:
           cpu: '1'
           memory: 2Gi
        requests:
           cpu: '1'
           memory: 512Mi

CPU request is equal to the CPU limit (CPU request & limit both are equal to 500 millicore)

 - resources:
       limits:
           cpu: '500m'
           memory: 2Gi
        requests:
           cpu: '500m'
           memory: 512Mi

Command used the measure the CPU spike in the microservice container:- kubectl top pod <POD_NAME> --containers

Observations made:-

With the configuration depicted in the scenario-1, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.

DEV03: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
DEV70: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.

With the configuration depicted in the scenario-2, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.

DEV03: The injected spike was always equal to the user defined intended value. Successful scenario
DEV70: The injected spike was always equal to the user defined intended value. Successful scenario

With the configuration depicted in the scenario-3, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.

DEV03: The injected spike was always equal to the user defined intended value. Successful scenario
DEV70: The injected spike was always equal to the user defined intended value. Successful scenario

With the configuration depicted in the scenario-4, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.

DEV03: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
DEV70: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.

Scenarios were successful in all cases where the CPU limit was 1 core irrespective of whether the request was equal or less than the limit - scenarios 2 & 3
Though scenarios 3 & 4 are identical i.e. CPU request and limit are equal, it only works when the values are in core (scenario 3) & doesn't work when the value are in millicore (scenario 4)
Though scenarios 1 & 2 are identical i.e. CPU request is less than the limit, it only works when the limit values are in core (scenario 2) & doesn't work when the value are in millicore (scenario 1)

Please note that whatever fixes or enhancements required for CPU fault, if any, may most likely apply to MEMORY fault as well.

DEV03-DC.txt
DEV70-DC.txt

Anvesh42 · 2022-04-18T18:25:13Z

@rpraveen-vmware @ashrimalivmware Has there been any update on this? I hope your team was able to replicate the scenarios that we went over during our meeting few weeks ago and also as depicted in detail above.

rpraveen-vmware · 2022-04-20T14:40:18Z

Hi @Anvesh42 , @ashrimalivmware
We tried to simulate the above scenarios in our k8s environment,
where similar to scenario1, deployed a pod with the cpu/memory configuration: Tried Application CPU spike fault of 80% on the pod.

Limits:
cpu: 1200m
memory: 4000Mi
Requests:
cpu: 900m
memory: 3800Mi

  We did see it crossing 80 percent of cpu spike while checking through kubectl top pod.
  However, we see that you have the pods deployed on openshift container.
  We would need to troubleshoot on this, if is behaviour is environment specific.

aswathy-ramabhadran added a commit that referenced this issue Nov 26, 2021

GitBook: [#82] Making a sample Doc change for Mangle 3.5

09d3c16

ashrimalivmware self-assigned this Dec 17, 2021

ashrimalivmware closed this as completed Dec 22, 2021

ashrimalivmware reopened this Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU spike fault not injecting #82

CPU spike fault not injecting #82

jv-frechstest commented Oct 11, 2021

ashrimalivmware commented Oct 12, 2021

ashrimalivmware commented Dec 17, 2021

ashrimalivmware commented Dec 22, 2021

Anvesh42 commented Feb 18, 2022

rpraveen-vmware commented Feb 21, 2022

Anvesh42 commented Feb 23, 2022

rpraveen-vmware commented Feb 24, 2022

Anvesh42 commented Feb 25, 2022

ashrimalivmware commented Feb 28, 2022

Anvesh42 commented Feb 28, 2022 •

edited

Loading

ashrimalivmware commented Mar 1, 2022

Anvesh42 commented Mar 15, 2022 •

edited

Loading

Anvesh42 commented Apr 18, 2022

rpraveen-vmware commented Apr 20, 2022 •

edited

Loading

CPU spike fault not injecting #82

CPU spike fault not injecting #82

Comments

jv-frechstest commented Oct 11, 2021

ashrimalivmware commented Oct 12, 2021

ashrimalivmware commented Dec 17, 2021

ashrimalivmware commented Dec 22, 2021

Anvesh42 commented Feb 18, 2022

rpraveen-vmware commented Feb 21, 2022

Anvesh42 commented Feb 23, 2022

rpraveen-vmware commented Feb 24, 2022

Anvesh42 commented Feb 25, 2022

ashrimalivmware commented Feb 28, 2022

Anvesh42 commented Feb 28, 2022 • edited Loading

ashrimalivmware commented Mar 1, 2022

Anvesh42 commented Mar 15, 2022 • edited Loading

Anvesh42 commented Apr 18, 2022

rpraveen-vmware commented Apr 20, 2022 • edited Loading

Anvesh42 commented Feb 28, 2022 •

edited

Loading

Anvesh42 commented Mar 15, 2022 •

edited

Loading

rpraveen-vmware commented Apr 20, 2022 •

edited

Loading