Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nodeResources): add GPU support #1708

Merged
merged 4 commits into from
Jan 3, 2025
Merged

Conversation

DexterYan
Copy link
Member

@DexterYan DexterYan commented Dec 19, 2024

Description, Motivation and Context

  • add resourceName and resourceAllocatable to filter gpu
  • add tests

ADR doc: https://docs.google.com/document/d/1LXuhzjzSsvuoOo4CnUXeq9SqfMYVPA0GpTNJj6PcX3g/edit?tab=t.0
sc-106618

Demo Yaml

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - clusterResources: {}
  analyzers:
    - nodeResources:
        filters:
          resourceName: nvidia.com/gpu
        checkName: Must have at least 1 GPU-enabled nodes in the cluster
        outcomes:
          - pass:
              when: "count() >= 1"
              message: "This application requires at least 1 GPU-enabled nodes"
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - clusterResources: {}
  analyzers:
    - nodeResources:
        filters:
          resourceName: nvidia.com/gpu
        checkName: Must have at least 1 GPU-enabled nodes in the cluster
        outcomes:
          - pass:
              when: "min(resourceAllocatable) = 1"
              message: "This application requires at least 1 GPU-enabled nodes"
Screenshot 2024-12-20 at 11 33 07 AM

Fixes: #1162

Checklist

  • New and existing tests pass locally with introduced changes.
  • Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@DexterYan DexterYan added the type::feature New feature or request label Dec 19, 2024
@DexterYan DexterYan requested a review from a team as a code owner December 19, 2024 05:26
@@ -417,6 +417,10 @@ spec:
type: string
podCapacity:
type: string
resourceAllocatable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourceCapacity is missing

@@ -382,6 +394,26 @@ func nodeMatchesFilters(node corev1.Node, filters *troubleshootv1beta2.NodeResou
return true, nil
}

if filters.ResourceName != "" {
if filters.ResourceAllocatable != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- nodeResources:
   filters:
     resourceName: gpu.intel.com/i915

Using the spec above, if a node without gpu.intel.com/i915 is present, this code will match it and count it in as a node with the intel GPU

We need to add a check here that checks if node.Status.Allocatable["gpu.intel.com/i915"] or node.Status.Capacity["gpu.intel.com/i915"] exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a unit test for this use case

Comment on lines 196 to 198
if filters != nil && filters.ResourceName != "" {
resourceName = filters.ResourceName
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if filters != nil && filters.ResourceName != "" {
resourceName = filters.ResourceName
}
if filters != nil {
resourceName = filters.ResourceName
}

totalNodeCount: len(nodeData),
expected: true,
isError: false,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For completeness, I'd add a sum unit test as well

@DexterYan
Copy link
Member Author

Thank you @banjoh, those tests have been added and code changed.

@DexterYan DexterYan requested a review from banjoh December 27, 2024 05:00
banjoh
banjoh previously approved these changes Dec 31, 2024
Signed-off-by: Evans Mungai <evans@replicated.com>
@DexterYan DexterYan merged commit 64ee9e5 into main Jan 3, 2025
24 checks passed
@DexterYan DexterYan deleted the dx/sc-106618/add-gpu-support branch January 3, 2025 02:11
@banjoh banjoh mentioned this pull request Jan 13, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: add GPU capabilities to nodeResources analyzer
2 participants