Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating healthcheck for passthrough GPUs #105

Merged
merged 1 commit into from
May 29, 2024

Conversation

visheshtanksale
Copy link
Contributor

For a GPU configured as passthrough , device plugin does not update the GPU count on the node when a GPU falls off the bus.

To reproduce follow the steps

Remove the GPU from the bus
echo "1" > /sys/bus/pci/devices/<gpu_pci_id>/remove

Validated the GPU is no longer visible from the host using lspci
lspci -nnk -d 10de:

The number of GPUs exposed on k8s node doesn't change.

Watching for iommu groups under /dev/vfio creates a fsnotify when the GPU falls off the bus

Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
@visheshtanksale visheshtanksale marked this pull request as draft May 24, 2024 18:50
@visheshtanksale visheshtanksale marked this pull request as ready for review May 24, 2024 18:50
@visheshtanksale
Copy link
Contributor Author

@rthallisey Please let me know how it looks

Copy link
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rthallisey
Copy link
Collaborator

Looks fine. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants