Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KB content #970

Merged
merged 1 commit into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion pages/doc/tas_to_troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,4 +216,28 @@ If you see an app called `tas2to-sli-test-app` in the results of `cf apps` or a

```
cf delete-route example.com --hostname tas2to-sli-test-app
```
```

## Symptom: The Percentage in the Application CPU % Chart Is Over 100%

The **Application CPU %** chart in the **TAS: Workload Monitoring** dashboard lists the application instances ranked by the highest utilization of their CPU entitlement. Sometimes, the **Application CPU %** chart might show high CPU usage percentage - more than 100%, for some of the applications. This happens when an application container is using more than its share of CPU, and the Diego Cell hosting the application has spare capacity.

Although CPU usage over 100% might seem unexpected or alarming, it can also occur normally as a result of intentional planning.

If the host has spare CPU, it does not throttle the CPU usage of applications. But, if the host has more demand on its CPU, it tries to throttle all the containers fairly, based on their entitlement. If an application is throttled, you can observe degraded performance from that application. If the application is functioning normally with occasional CPU spikes, you can choose to leave the memory and CPU entitlement unchanged.

Because throttling only happens on Diego Cells that are heavily utilized, the CPU utilization of the Diego Cells themselves is more important to monitor than the container CPU utilization. Application-level performance metrics, such as RED metrics, are also often more important than container CPU.

If an application container is consistently using more than its CPU entitlement, the solution is to scale up the memory requested by the app, which also increases the container's CPU share. See [App manifest attribute reference](https://docs.cloudfoundry.org/devguide/deploy-apps/manifest-attributes.html#memory).

Note that increasing the app memory might mean that Diego must move the application to a Cell with more available CPU and memory entitlement. If Diego is unable to find a Cell to run the larger application container, the solution is to scale up the Diego Cells, either horizontally or vertically. Adding Diego Cells or increasing the size of your Diego Cells will increase your infrastructure costs.

Considering your own infrastructure costs and performance goals, you may want to keep the memory and CPU entitlements for certain apps at their **typical** CPU utilization, rather than at their **peak** utilization. If you provision all your apps with memory and CPU for their **peak** needs, you will need more Diego capacity to schedule those apps, and much of this capacity will be unused most of the time. If you provision most apps for their **typical** utilization, you will see spikes above their entitlement, but you will be using your infrastructure more efficiently and will see lower infrastructure costs. The optimal choice is to provision some apps for **peak** utilization and others for **typical** utilization, based on business priority or performance sensitivity.

## Symptom: The Percentage in the CPU Usage Chart Is Over 100%

The **CPU Usage** chart in the **TAS: BOSH Director Health** dashboard might show CPU usage higher than 100%. This is because when using multi-core processors in CPU instrumentation, the usage maximum is the number of cores multiplied by 100.

Because modern computers have multiple cores, where previously they were predominantly single-core processors, CPU instrumentation can show CPU utilization greater than 100%.

If you observe high value of `system_cpu_core_sys` reported for the BOSH Director for the displayed time interval, you can investigate the cause of the spike. If the cause is a normal workload increase, then simply increase the CPU allocation for the BOSH Director.
2 changes: 1 addition & 1 deletion pages/doc/wavefront_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ Many customers automate the creation, addition, and deletion of alerts, dashboar

* **Problem**

Some WQL querie requite quotes for the function to operate correctly. If you use a query that omits required quotes, for example, in an alert condition, a `400` error results.
Some WQL queries requite quotes for the function to operate correctly. If you use a query that omits required quotes, for example, in an alert condition, a `400` error results.

For example, assume you use the following fragment to create an alert:

Expand Down