diff --git a/SUMMARY.md b/SUMMARY.md index 03f888459..de00400a1 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -108,6 +108,8 @@ * [Spot Checklist](using-kubecost/navigating-the-kubecost-ui/savings/spot-checklist.md) * [Spot Commander](using-kubecost/navigating-the-kubecost-ui/savings/spot-commander.md) * [Persistent Volume Right-Sizing Recommendations](using-kubecost/navigating-the-kubecost-ui/savings/pv-right-sizing-rec.md) + * [GPU Optimization](using-kubecost/navigating-the-kubecost-ui/savings/gpu-optimization.md) + * [Turbonomic Actions](using-kubecost/navigating-the-kubecost-ui/savings/turbonomic-actions.md) * [Budgets](using-kubecost/navigating-the-kubecost-ui/budgets.md) * [Audits](using-kubecost/navigating-the-kubecost-ui/audits.md) * [Anomaly Detection](using-kubecost/navigating-the-kubecost-ui/anomaly-detection.md) @@ -161,6 +163,7 @@ * [Container Request Right Sizing Recommendation API (V2)](apis/savings-apis/api-request-right-sizing-v2.md) * [Container Request Recommendation Apply/Plan APIs](apis/savings-apis/api-request-recommendation-apply.md) * [Abandoned Workloads API](apis/savings-apis/api-abandoned-workloads.md) + * [Turbonomic Actions APIs](apis/savings-apis/api-turbonomic-actions.md) * [Filter Parameters (v2)](apis/filters-api.md) ## Architecture @@ -186,6 +189,7 @@ * [Importing Kubecost Data into Microsoft Power BI](integrations/import-kubecost-data-into-microsoft-power-bi.md) * [Integrating Kubecost with Datadog](integrations/integrating-kubecost-with-datadog.md) * [Using Custom Webhook to Create a Kubecost Stage in Spinnaker](integrations/spinnaker-custom-webhook.md) +* [Kubecost Turbonomic Integration](integrations/turbonomic-integration.md) ## Troubleshooting diff --git a/apis/savings-apis/api-turbonomic-actions.md b/apis/savings-apis/api-turbonomic-actions.md new file mode 100644 index 000000000..65590e3f0 --- /dev/null +++ b/apis/savings-apis/api-turbonomic-actions.md @@ -0,0 +1,174 @@ +# Turbonomic Actions + +{% swagger method="get" path="turbonomic/resizeWorkloadControllers" baseUrl="http:///model/savings/" summary="Turbonomic Actions: Resize Workload Controllers" %} +{% swagger-description %} +The Resize Workload Controllers API returns workloads for which request resizing has been recommended by Turbonomic. The list of results returned should align with those in the Turbonomic Actions Center. +{% endswagger-description %} + +{% swagger-parameter in="path" name="filter" type="string" required="false" %} +Filter your results by cluster, namespace and/or controller. +{% endswagger-parameter %} + +{% swagger-response status="200: OK" description="" %} +```json +{ + "code": 200, + "data": { + "numResults": 1, + "totalSavings": 2.00, + "actions": [ + { + "action": { + "cluster": "standard-cluster-1", + "namespace": "kubecost", + "controller": "kubecost-cost-analyzer", + "replicaCount": 1, + "compoundActions": { + "cost-model": [ + { + "target": "VCPURequest", + "unit": "mCores", + "oldValue": 200, + "newValue": 100 + } + ] + }, + "available": true, + "targetId": "11111111111111" + }, + "currentMonthlyRate": 4.00, + "predictedMonthlyRate": 2.00, + "predictedSavings": 2.00 + } + ] + } +} +``` +{% endswagger-response %} +{% endswagger %} + +{% swagger method="get" path="turbonomic/suspendContainerPods" baseUrl="http:///model/savings/" summary="Turbonomic Actions: Suspend Container Pods" %} +{% swagger-description %} +The Suspend Container Pods API returns pods that Turbonomic recommends for suspension. The list of results returned should align with those in the Turbonomic Actions Center. +{% endswagger-description %} + +{% swagger-parameter in="path" name="filter" type="string" required="false" %} +Filter your results by cluster, namespace, controller and/or pod. +{% endswagger-parameter %} + +{% swagger-response status="200: OK" description="" %} +```json +{ + "code": 200, + "data": { + "numResults": 1, + "totalSavings": 12.37, + "actions": [ + { + "action": { + "cluster": "standard-cluster-1", + "namespace": "infra-cost", + "controller": "infra-cost-agent", + "pod": "infra-cost-agent-xdj34", + "available": true, + "targetId": "11111111111111" + }, + "currentMonthlyRate": 12.37, + "predictedMonthlyRate": 0, + "predictedSavings": 12.37 + } + ] + } +} +``` +{% endswagger-response %} +{% endswagger %} + +{% swagger method="get" path="turbonomic/suspendVirtualMachines" baseUrl="http:///model/savings/" summary="Turbonomic Actions: Suspend Virtual Machines" %} +{% swagger-description %} +The Suspend Container Pods API returns virtual machines that Turbonomic recommends for suspension. The list of results returned should align with those in the Turbonomic Actions Center. +{% endswagger-description %} + +{% swagger-parameter in="path" name="filter" type="string" required="false" %} +Filter your results by cluster. +{% endswagger-parameter %} + +{% swagger-response status="200: OK" description="" %} +```json +{ + "code": 200, + "data": { + "numResults": 1, + "totalSavings": 9.03, + "actions": [ + { + "action": { + "cluster": "standard-cluster-1", + "node": "gke-standard-cluster-1-spotpool-b4a02c44-1001", + "available": true, + "targetId": "11111111111111" + }, + "currentMonthlyRate": 9.03, + "predictedMonthlyRate": 0, + "predictedSavings": 9.03 + } + ] + } +} +``` +{% endswagger-response %} +{% endswagger %} + +{% swagger method="get" path="turbonomic/moveContainerPods" baseUrl="http:///model/savings/" summary="Turbonomic Actions: Move Container Pods" %} +{% swagger-description %} +The Move Container Pods API returns pods that Turbonomic recommends to be moved from one node to another. The list of results returned should align with those in the Turbonomic Actions Center. +{% endswagger-description %} + +{% swagger-parameter in="path" name="filter" type="string" required="false" %} +Filter your results by cluster, namespace, controller and/or pod. +{% endswagger-parameter %} + +{% swagger-response status="200: OK" description="" %} +```json +{ + "code": 200, + "data": { + "numResults": 2, + "totalSavings": 30.0, + "actions": [ + { + "action": { + "cluster": "standard-cluster-1", + "namespace": "turbo-server", + "controller": "db", + "pod": "db-ffbdfb97b-aroxf", + "originNode": "gke-standard-cluster-1-pool-1-b4a02c44-1001", + "destinationNode": "gke-standard-cluster-1-pool-2-91dc432d-1002", + "available": true, + "targetId": "11111111111111" + }, + "currentMonthlyRate": 27.90, + "predictedMonthlyRate": 0, + "predictedSavings": 27.90 + }, + { + "action": { + "cluster": "standard-cluster-1", + "namespace": "infra-kubecost", + "controller": "infra-kubecost-cost-analyzer", + "pod": "infra-kubecost-cost-analyzer-566b488b69-1001a", + "originNode": "gke-standard-cluster-1-pool-2-91dc432d-1002", + "destinationNode": "gke-standard-cluster-1-pool-3-57364626-1003", + "available": true, + "targetId": "11111111111112" + }, + "currentMonthlyRate": 2.10, + "predictedMonthlyRate": 0, + "predictedSavings": 2.10 + } + ] + } +} +``` +{% endswagger-response %} +{% endswagger %} \ No newline at end of file diff --git a/images/gpu-savings-optimize-dashboard.png b/images/gpu-savings-optimize-dashboard.png new file mode 100644 index 000000000..39c57ff5b Binary files /dev/null and b/images/gpu-savings-optimize-dashboard.png differ diff --git a/images/gpu-savings-optimize-modal.png b/images/gpu-savings-optimize-modal.png new file mode 100644 index 000000000..3b928f0e5 Binary files /dev/null and b/images/gpu-savings-optimize-modal.png differ diff --git a/images/savings-turbo-actions-mcp.png b/images/savings-turbo-actions-mcp.png new file mode 100644 index 000000000..2bb0bfb10 Binary files /dev/null and b/images/savings-turbo-actions-mcp.png differ diff --git a/images/savings-turbo-actions-rwc.png b/images/savings-turbo-actions-rwc.png new file mode 100644 index 000000000..3d6eba3ca Binary files /dev/null and b/images/savings-turbo-actions-rwc.png differ diff --git a/images/savings-turbo-actions-scp.png b/images/savings-turbo-actions-scp.png new file mode 100644 index 000000000..5ccdef77d Binary files /dev/null and b/images/savings-turbo-actions-scp.png differ diff --git a/images/savings-turbo-actions-svm.png b/images/savings-turbo-actions-svm.png new file mode 100644 index 000000000..a38778981 Binary files /dev/null and b/images/savings-turbo-actions-svm.png differ diff --git a/images/savings-turbo-actions.png b/images/savings-turbo-actions.png new file mode 100644 index 000000000..91aa1c2d4 Binary files /dev/null and b/images/savings-turbo-actions.png differ diff --git a/install-and-configure/advanced-configuration/gpu.md b/install-and-configure/advanced-configuration/gpu.md index 8691de5fc..292ca8c20 100644 --- a/install-and-configure/advanced-configuration/gpu.md +++ b/install-and-configure/advanced-configuration/gpu.md @@ -348,3 +348,38 @@ kubectl -n kubecost port-forward svc/kubecost-prometheus-server 8080:80 Open the Prometheus web interface in your browser by navigating to `http://localhost:8080`. In the search box, begin typing the prefix for a metric, for example `DCGM_FI_DEV_POWER_USAGE`. Click Execute to view the returned query and verify that there is data present. An example is shown below. ![Prometheus query showing DCGM Exporter metric](/images/gpu-prometheus-query.png) + +## Shared GPU Support + +Kubecost supports NVIDIA GPU sharing using either the CUDA [time-slicing](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) or [Multi-Process Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) methods. MIG is currently unsupported but is being evaluated for a future release. When employing either time-slicing or MPS, you must use the `renameByDefault=true` option in the [NVIDIA device plugin's](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#shared-access-to-gpus) configuration stanza. This parameter instructs the device plugin to advertise the resource `nvidia.com/gpu.shared` on nodes where GPU sharing is enabled. Without this configuration option, the device plugin will instead advertise `nvidia.com/gpu` which will mean Kubecost is unable to disambiguate an "exclusive" GPU access request from a shared GPU access request. As a result, Kubecost's cost information will be inaccurate. + +{% hint style="warning" %} +Prior to enabling GPU sharing in your cluster, view the [Limitations](#limitations) section to determine if this is right for you. +{% endhint %} + +The following is an example of a time-slicing configuration which sets the `renameByDefault` parameter. + +```yaml +version: v1 +sharing: + timeSlicing: + renameByDefault: true + failRequestsGreaterThanOne: true + resources: + - name: nvidia.com/gpu + replicas: 4 +``` + +With this configuration saved and applied to nodes, they will begin to advertise the `nvidia.com/gpu.shared` device with a quantity equal to the replica count, defined in the configuration, multiplied by the number of physical GPUs inside the node. For example, a node with four (4) physical NVIDIA GPUs which uses this configuration will advertise sixteen (16) shared GPU devices. + +```sh +$ kubectl describe node mynodename +... +Capacity: + nvidia.com/gpu.shared: 16 +... +``` + +### Limitations + +There are limitations of which to be aware when using NVIDIA GPU sharing with either time-slicing or MPS. Because [NVIDIA does not support](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html#limitations) providing utilization metrics via DCGM Exporter for containers using shared GPUs, Kubecost will display a GPU cost of zero for these workloads. However, the [GPU Savings Optimization](/using-kubecost/navigating-the-kubecost-ui/savings/gpu-optimization.md) card (Kubecost Enterprise) will be able to indicate in the utilization table which containers are configured for GPU sharing providing some visibility. diff --git a/integrations/turbonomic-integration.md b/integrations/turbonomic-integration.md new file mode 100644 index 000000000..f9e4a4deb --- /dev/null +++ b/integrations/turbonomic-integration.md @@ -0,0 +1,56 @@ +# Kubecost Turbonomic Integration + +{% hint style="info" %} +This integration is currently in beta. Please read the documentation carefully. +{% endhint %} + +The Turbonomic Integration feature enables users to obtain supplemental cost information on actions recommended by Turbonomic. This integration is required to display the [Turbonomic Actions Savings Cards](../using-kubecost/navigating-the-kubecost-ui/savings/turbonomic-actions.md). + +## Usage + +Prerequisites: + +- A running Turbonomic client + +Kubecost will require network access to your Turbonomic installation via an OAuth 2.0 Client. We require the following settings on the OAuth client: +- Role: `ADVISOR` +- ClientAuthenticationMethods: `client_secret_post` + +Please see the [IBM Turbonomic documentation](https://www.ibm.com/docs/en/tarm/8.14.3?topic=cookbook-authenticating-oauth-20-clients-api#cookbook_administration_oauth_authentication__title__4) on more instructions on how to create an OAuth 2.0 client. + +### Step 1: Configure Helm values + +The below YAML is an example of how to configure the Turbonomic integration in your Helm values file. + +```yaml +global: + integrations: + turbonomic: + enabled: true + clientId: "" # REQUIRED. OAuth 2.0 client ID + clientSecret: "" # REQUIRED. OAuth 2.0 client secret + role: "ADVISOR" # REQUIRED. OAuth 2.0 client role + host: "" # REQUIRED. URL to the Turbonomic API (e.g. "https://turbonomic.example.com") + insecureClient: false # Whether to verify certificate or not. Default false. +``` + +### Step 2: Apply and validate your changes + +If deploying changes via Helm, you will be able to run a command similar to: + +```sh +helm upgrade -i kubecost cost-analyzer \ + --repo https://kubecost.github.io/cost-analyzer/ \ + --namespace kubecost \ + -f values.yaml +``` + +Once you've applied your changes, validate that the integration is successful by checking the Aggregator pod logs. You should see logs similar to the following: + +```sh +kubectl logs statefulset/kubecost-aggregator -n kubecost | grep -i "Turbonomic" +``` + +```txt +DBG Turbonomic: Ingestor: completed run with 32 turbonomic actions ingested +``` \ No newline at end of file diff --git a/using-kubecost/navigating-the-kubecost-ui/savings/gpu-optimization.md b/using-kubecost/navigating-the-kubecost-ui/savings/gpu-optimization.md new file mode 100644 index 000000000..cd48d7d2d --- /dev/null +++ b/using-kubecost/navigating-the-kubecost-ui/savings/gpu-optimization.md @@ -0,0 +1,44 @@ +# GPU Optimization + +The GPU optimization page, a Kubecost Enterprise feature, shows you details on your workloads (containers and their relatives) which are using GPUs and proactively identifies ways in which you can save money on them. Kubecost collects and processes [GPU utilization metrics](/install-and-configure/advanced-configuration/gpu.md) to power the contents of this page. The page is broken down into two main sections: a workload utilization table and recommendation cards. + +{% hint style="info" %} +If the GPU Optimization savings card appears to be greyed out, click the meatballs menu in the upper right and select "Unarchive". +{% endhint %} + +![GPU Optimization dashboard](/images/gpu-savings-optimize-dashboard.png) + +## Utilization Table + +The utilization table displays the GPU-related workloads in your Kubecost environment and provides many details which can be helpful to understand what is going on. Unlike other pages in Kubecost which display all workloads, the utilization table on this page is constrained to only workloads which are requesting some amount of GPU. It is not an extraction of the Allocation page, for example. Aggregations which do not feature a GPU in some way will be intentionally absent from this table. For example, your Kubecost estate has three (3) clusters but only one (1) of them has GPUs. Only the cluster with GPUs will display content on this table. + +Depending on the aggregation, there will be information presented specific to that aggregation that may not be found on others. For example, aggregating by cluster shows the the number of nodes containing at least one GPU as well as the number of containers requesting at least one GPU during the given time window. The container and pod aggregations show, among other columns, the node on which this ran or is running, whether it is using [shared GPUs](/install-and-configure/advanced-configuration/gpu.md#shared-gpu-support), and its average and max utilization of those GPUs. + +A utilization threshold slider is provided at the top of this table allowing you to constrain the returned results to a value of the GPU utilization, either maximum or average, up to and inclusive of that number. This is to allow easier identification of GPU-related workloads across your estate. For example, you wish to view workloads which are using a maximum GPU utilization of up to 80%. Set the slider to 80% and Kubecost filters from view any workloads above this number. + +## Recommendations + +The bottom half of the page presents recommendations on where and how to save money on GPU workloads. Depending on the time window defined at the top of the page, Kubecost locates and displays one card per container where it has identified a possible savings opportunity. Each recommendation is presented as a separate card. + +Kubecost provides proactive recommendations on how to save money on GPU workloads in three different categories: Optimize, Remove, and Share. + +- **Optimize**: Containers which request more than one GPU but are not using at least one of those GPUs will trigger the Optimize recommendation. In this card, Kubecost shows the container which can be optimized by reconfiguring it to remove the number of unused GPUs observed during the time window selected. This can be useful, for example, in cases where the application in the container was either not written to make use of multiple GPUs or where use of multiple GPUs is not achieved due to the nature of the workload. The possible savings displayed on this tile is the cost of only the unused GPUs over the course of a month. +- **Remove**: Containers which request a single GPU but are found to not use it are flagged for removal. In this card, Kubecost shows the container which can be removed from the cluster thereby freeing up its GPU. You may see this card if, for example, a workload has been created which requests a GPU but never uses it due to a misconfiguration, or where a workload did use a GPU for a period of time but that use has ended yet the container continues to run. Whatever your case, containers which request but do not use a GPU make it such that other workloads such as pending jobs cannot be scheduled due to "GPU squatting." The possible savings displayed on this tile is the cost of removing this container entirely from the cluster over the course of a month. +- **Share**: Containers which request a single GPU but are using somewhere between zero and 100% are identified as candidates for GPU sharing. In this card, Kubecost shows the container which is not fully utilizing a GPU and can potentially request access to a shared GPU instead. GPU sharing is a technique whereby multiple containers, each which need some GPU resources, all execute concurrently on a single GPU thereby potentially reducing costs by requiring fewer total GPUs. See the section on GPU sharing [here](/install-and-configure/advanced-configuration/gpu.md#shared-gpu-support) for more details on how or if this is right for you. Because reconfiguring a workload to request access to a shared GPU is highly variable and depends on many factors, Kubecost does not show a possible savings number associated with this recommendation type. This does not mean, however, that no savings are likely to result in configuring your cluster and appropriate workloads for GPU sharing. + +Clicking on each recommendation tile displays a window with further details on the recommendation designed to help you identify exactly which workload Kubecost has flagged and more information on why the recommendation was made all with the goal of helping you gain confidence in the accuracy of the recommendation. The window contains a utilization graph over the selected time window, details on the container and its location in the cluster, and an explanation with more details on the recommendation. + +![GPU Optimization savings modal](/images/gpu-savings-optimize-modal.png) + +## Known Limitations + +In the first version of the GPU Optimization Savings Insights card there are a few limitations of which to be aware. + +- Multiple containers with the same name and running on the same cluster, node, and namespace combination (i.e., "identical" containers) might result in the following effects: + - The savings number provided on Optimize and Remove cards may be an implicit sum of the total cost these containers. + - Recommendations will only be provided for one of them. + - The utilization table may not show these identical containers. +- GPU nodes must be running or have run at least one container utilizing a GPU for it to be represented on the utilization table in either the Cluster aggregation’s GPU nodes column or on the Node aggregation. +- Optimize may be as accurate as possible in certain cases since Kubecost currently infers utilization about all GPUs from a single averaged utilization number. +- For upgrades from prior versions to 2.5.0, there may be cases where Max. GPU Utilization could be a smaller percentage than Avg. GPU Utilization. This will self correct once the chosen window size is smaller than the time the 2.5.0 instance has been collecting the new max. GPU util. metric. +- The GPU Optimization card on the Savings Insights screen may initially appear greyed out. Click the meatballs icon in the upper right and choose "Unarchive" to make the card appear as the others. diff --git a/using-kubecost/navigating-the-kubecost-ui/savings/savings.md b/using-kubecost/navigating-the-kubecost-ui/savings/savings.md index f43d45c7e..8c10a6e35 100644 --- a/using-kubecost/navigating-the-kubecost-ui/savings/savings.md +++ b/using-kubecost/navigating-the-kubecost-ui/savings/savings.md @@ -20,12 +20,18 @@ The monthly savings values on this page are precomputed every hour for performan * [Manage underutilized nodes](underutilized-nodes.md) * [Right-size your persistent volumes](pv-right-sizing-rec.md) -### Cloud insights: +### Cloud insights * Reserve instances * [Manage orphaned resources](orphaned-resources.md) * [Spot Instances](spot-checklist.md) +### Turbonomic Actions insights +* [Resize Workload Controllers](turbonomic-actions.md) +* [Suspend Container Pods](turbonomic-actions.md) +* [Suspend Virtual Machines](turbonomic-actions.md) +* [Move Container Pods](turbonomic-actions.md) + ## Archiving Savings insights You can archive individual Savings insights if you feel they are not helpful, or you cannot perform those functions within your organization or team. Archived Savings insights will not add to your estimated monthly savings available. diff --git a/using-kubecost/navigating-the-kubecost-ui/savings/turbonomic-actions.md b/using-kubecost/navigating-the-kubecost-ui/savings/turbonomic-actions.md new file mode 100644 index 000000000..b215d9fd7 --- /dev/null +++ b/using-kubecost/navigating-the-kubecost-ui/savings/turbonomic-actions.md @@ -0,0 +1,40 @@ +# Turbonomic Actions + +{% hint style="warning" %} +This feature is in beta. Please read the documentation carefully. +{% endhint %} + +The [IBM Turbonomic Action Center](https://www.ibm.com/docs/en/tarm/8.14.3?topic=reference-turbonomic-actions) offers multiple types of actions destined to improve the overall performance of your cluster(s). The integration between Kubecost and Turbonomic allows you to view the estimated savings incurred by executing these actions. + +## Prerequisites +To be able to see the savings cards, you must first enable the [Turbonomic Integration](../../../integrations/turbonomic-integration.md). This is required for Kubecost to be able to pull action data from your Turbonomic client. + +## Actions +![Savings cards: Turbonomic Actions](../../../images/savings-turbo-actions.png) + +### Resize Workload Controllers +![Resize Workload Controllers](../../../images/savings-turbo-actions-rwc.png) + +The Resize Workload Controllers page shows workloads which would benefit from changes to their resource requests, as recommended by Turbonomic. +The Current and Predicted cost columns are calculated using the [Spec Cost Prediction API](../../../apis/governance-apis/spec-cost-prediction-api.md): the Current column is calculated by inferring the CPU and/or memory requests on all Containers in the workload, while the Predicted column is calculated by using the new request values recommended by Turbonomic for each container. Please note that at the moment, this functionality is only available for Deployments and StatefulSets. + +### Suspend Container Pods +![Suspend Container Pods](../../../images/savings-turbo-actions-scp.png) + +The Suspend Container Pods page shows pods that Turbonomic recommends to be suspended. +The Current cost column represents the monthly rate for the Pod in question, queried over a period of `7d offset 48h` to account for reconciliation. +The Predicted cost column is zero for suspension actions. + +### Suspend Virtual Machines +![Suspend Virtual Machines](../../../images/savings-turbo-actions-svm.png) + +The Suspend Virtual Machines page shows virtual machines (nodes) that Turbonomic recommends to be suspended. +The Current cost column represents the monthly rate for the node in question, queried over a period of `7d offset 48h` to account for reconciliation. +The Predicted cost column is zero for suspension actions. + +### Move Container Pods +![Move Container Pods](../../../images/savings-turbo-actions-mcp.png) + +The Move Container Pods page shows pods that Turbonomic recommends to be moved from one node to another. +The Current cost column represents the monthly rate for the destination node, queried over a period of `7d offset 48h` to account for reconciliation. +The Efficiency column contains a hyperlink to the Efficiency page for the destination node, highlighting the infrastructure idle corresponding to it.