Skip to content

Commit

Permalink
KB content - TAS and a link in the Troubleshooting missing data
Browse files Browse the repository at this point in the history
  • Loading branch information
Margarita-Staneva committed Oct 5, 2023
1 parent 938f623 commit 5395a06
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 2 deletions.
26 changes: 25 additions & 1 deletion pages/doc/tas_to_troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,4 +216,28 @@ If you see an app called `tas2to-sli-test-app` in the results of `cf apps` or a

```
cf delete-route example.com --hostname tas2to-sli-test-app
```
```

## Symptom: The Percentage in the Application CPU % Chart Is Over 100%

The **Application CPU %** chart in the **TAS: Workload Monitoring** dashboard lists the application instances ranked by the highest utilization of their CPU entitlement. Sometimes, the **Application CPU %** chart might show high CPU usage percentage - more than 100%, for some of the applications. This happens when an application container is using more than its share of CPU, and the Diego Cell hosting the application has spare capacity.

Although CPU usage over 100% might seem unexpected or alarming, it can also occur normally as a result of intentional planning.

If the host has spare CPU, it does not throttle the CPU usage of applications. But, if the host has more demand on its CPU, it tries to throttle all the containers fairly, based on their entitlement. If an application is throttled, you can observe degraded performance from that application. If the application is functioning normally with occasional CPU spikes, you can choose to leave the memory and CPU entitlement unchanged.

Because throttling only happens on Diego Cells that are heavily utilized, the CPU utilization of the Diego Cells themselves is more important to monitor than the container CPU utilization. Application-level performance metrics, such as RED metrics, are also often more important than container CPU.

If an application container is consistently using more than its CPU entitlement, the solution is to scale up the memory requested by the app, which also increases the container's CPU share. See [App manifest attribute reference](https://docs.cloudfoundry.org/devguide/deploy-apps/manifest-attributes.html#memory).
Note that increasing the app memory might mean that Diego must move the application to a Cell with more available CPU and memory entitlement. If Diego is unable to find a Cell to run the larger application container, the solution is to scale up the Diego Cells, either horizontally or vertically. Adding Diego Cells or increasing the size of your Diego Cells will increase your infrastructure costs.
Considering your own infrastructure costs and performance goals, you may want to keep the memory and CPU entitlements for certain apps at their **typical** CPU utilization, rather than at their **peak** utilization. If you provision all your apps with memory and CPU for their **peak** needs, you will need more Diego capacity to schedule those apps, and much of this capacity will be unused most of the time. If you provision most apps for their **typical** utilization, you will see spikes above their entitlement, but you will be using your infrastructure more efficiently and will see lower infrastructure costs. The optimal choice is to provision some apps for **peak** utilization and others for **typical** utilization, based on business priority or performance sensitivity.
## Symptom: The Percentage in the CPU Usage Chart Is Over 100%
The **CPU Usage** chart in the **TAS: BOSH Director Health** dashboard might show CPU usage higher than 100%. This is because when using multi-core processors in CPU instrumentation, the usage maximum is the number of cores multiplied by 100.
Because modern computers have multiple cores, where previously they were predominantly single-core processors, CPU instrumentation can show CPU utilization greater than 100%.
If you observe high value of `system_cpu_core_sys` reported for the BOSH Director for the displayed time interval, you can investigate the cause of the spike. If the cause is a normal workload increase, then simply increase the CPU allocation for the BOSH Director.
2 changes: 1 addition & 1 deletion pages/doc/wavefront_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ Many customers automate the creation, addition, and deletion of alerts, dashboar
* **Problem**
Some WQL querie requite quotes for the function to operate correctly. If you use a query that omits required quotes, for example, in an alert condition, a `400` error results.
Some WQL queries requite quotes for the function to operate correctly. If you use a query that omits required quotes, for example, in an alert condition, a `400` error results.
For example, assume you use the following fragment to create an alert:
Expand Down

0 comments on commit 5395a06

Please sign in to comment.