-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add platform detection logic from nvidia-container-toolkit #28
Conversation
pkg/nvlib/info/info.go
Outdated
// UsesNVGPUModule checks whether the nvgpu module is used. | ||
// We use the device name to signal this, since devices that use the nvgpu module have their device | ||
// names as: | ||
// | ||
// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4) | ||
// | ||
// This function returns true if ALL devices use the nvgpu module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the nvgpu
module? Is this the name of the kernel module for iGPUs? Maybe mention that here somehwere if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the kernel module for igpus. I can update the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this in the latest. Please take another look.
1d59c95
to
0cfc1ac
Compare
2202e44
to
9bbfd8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. I left one non-blocking question concerning the behavior when platform detection fails.
pkg/nvlib/info/resolver.go
Outdated
if (isTegra && !hasNVML) || usesNVGPUModule { | ||
return "csv" | ||
} | ||
return "nvml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this PR - would it be better to return ""
if an unknown platform is detected? I understand our components currently default to using "nvml" mode if platform detection fails, but would it be better for each client to decide how they want to handle this case where platform detection fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that I think about it, we definitely should not be returning nvml
by default here. For example, while reviewing NVIDIA/k8s-device-plugin#673, I notice we fail early when platform != info.PlatformNVML && platform != info.PlatformTegra
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. This was initially a mode
check, but I reworked it to be a platform
check intead. I will add a PlatformUnknown
return value instead.
9bbfd8e
to
428de2b
Compare
@@ -28,7 +29,7 @@ build: | |||
GOOS=$(GOOS) go build ./... | |||
|
|||
all: check build binary | |||
check: assert-fmt lint vet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily have an opinion here, just curious what made you decide to remove assert-fmt
and vet
from the checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all handled by golangci-lint
(could have been a different PR though).
"github.com/NVIDIA/go-nvml/pkg/dl" | ||
) | ||
|
||
type root string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this rootPath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for a more generic name than *Path
is that this type is intended to be used to extract more information about the driver root. There are multiple implementations of it in various projects and I need to consolidate them at some point soon.
The idea is that this struct will at least be able to:
- Return the path to the driver libraries at this root.
- Return the path to the device nodes at this path (or
/
)
I don't know whether the Path
suffix conveys this.
I'm happy to make the change here now if you feel strongly about it since we don't have the other functionality consolidated yet.
pkg/nvlib/info/builder.go
Outdated
type builder struct { | ||
root root | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this options
since its the place to enumerate / gather the options passed in?
pkg/nvlib/info/builder.go
Outdated
func (b *builder) build() Interface { | ||
return &infolib{ | ||
root: root(b.root), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this extra function necessary? Do we call it anywhere other than in in New()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, your're right. The initial implementation was structured a bit differently, and I did not change this when I simplified things. Let me rework it.
pkg/nvlib/info/api.go
Outdated
@@ -23,9 +23,10 @@ type Interface interface { | |||
|
|||
// Properties provides a set of functions to query capabilities of the system. | |||
// | |||
//go:generate moq -stub -out properties_mock.go . Properties | |||
//go:generate moq -rm -stub -out properties_mock.go . Properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the -rm
do? Remove before regenerating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added it here, because I always get errors and have to manually remove the file before generating the mocks.
pkg/nvlib/info/builder.go
Outdated
@@ -29,11 +37,21 @@ func New(opts ...Option) Interface { | |||
if b.root == "" { | |||
b.root = "/" | |||
} | |||
if b.nvmllib == nil { | |||
b.nvmllib = nvml.New( | |||
nvml.WithLibraryPath(b.root.tryResolveLibrary("libnvidia-ml.so.1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a constant for "libnvidia-ml.so.1"? Does it actually make sense to define the constant in go-nvml itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have:
defaultNvmlLibraryName = "libnvidia-ml.so.1"
in go-nvml. I suppose we can make it public.
pkg/nvlib/info/info.go
Outdated
// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4) | ||
// | ||
// This function returns true if ALL devices use the nvgpu module. | ||
func (i *infolib) UsesNVGPUModule() (uses bool, reason string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call it OnlyUsesNVGPUModule
(or something else to indicate that it only returns true if all. GPUs are managed by this module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fair. Let me update.
pkg/nvlib/info/info.go
Outdated
// We ensure that this function never panics | ||
defer func() { | ||
if err := recover(); err != nil { | ||
uses = false | ||
reason = fmt.Sprintf("panic: %v", err) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What might trigger a panic here to necessitate this (and not necessitate it in the other calls)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to possibly panic in nvml.Shutdown
. This is no longer the case and as such it can probably be removed.
pkg/nvlib/info/info.go
Outdated
// | ||
// Devices that use the nvgpu module have their device names as: | ||
// | ||
// GPU 0: Orin (nvgpu) (UUID: 54d0709b-558d-5a59-9c65-0c5fc14a21a4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this naming likely to be stable? I guess we can keep adding cases to this function to adapt as things change, but it still feel "brittle" to rely solely on the name like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's definitely brittle. If you recall we were pushing for a more stable mechanism to check this when nvml
support was added to Orin-based systems. This was the suggested workaround. We should definitely revisit it.
pkg/nvlib/info/api.go
Outdated
// Resolver defines a function to resolve a mode. | ||
type Resolver interface { | ||
ResolvePlatform() Platform | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve a mode
or resolve a platform
?
Also, as a caller, info.Resolver()
is not a very user-friendly name. maybe the more verbose PlatformResolver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This was initially copying the concept of a mode
from the NVIDIA container toolkit, but it isn't flexible enought and needs to be a lower level to also support the device plugin use cases. The comment needs to be updated and I can update the type too.
pkg/nvlib/info/info.go
Outdated
root root | ||
nvmllib nvml.Interface | ||
devicelib device.Interface | ||
} | ||
|
||
var _ Properties = &info{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to get lost in all the differences in naming and levels of indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a fair point. I made another pass at this that I think is a lot cleaner.
I now call the interface PropertyExtractor
and implement the Has*
or Uses*
functions there. The info
type has also been explicitly renamed to propertyExtractor
.
pkg/nvlib/info/info.go
Outdated
type infolib struct { | ||
logger basicLogger | ||
Properties | ||
Resolver | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So an infolib now IS a "Properties" and IS a "Resolver", and HAS a logger. Is there a better name for the Properties interface? I think that is where I am getting confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose Properties
because the interface implements a set of bool, string
functions that return information about the system. These system properties are then used to make decissions -- such as determining the platform.
I'm happy to discuss alternative names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top-level logger is not needed and has been removed. I have also renamed Properties
to ProperyExtractor
in the lates revision.
Properties | ||
} | ||
|
||
func (p platformResolver) ResolvePlatform() Platform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction seems nice and straightforward. Where will it be consumed exactly? Is it meant as a replacement for direct calls to HasNvml()
IsTegraSystem()
etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most cases, yes, the ResolvePlatform()
call will replace the individual calls and replicating the conditional logic.
I have already updated the nvidia-container-toolkit
and device-plugin
to make use of this functionality:
- https://github.com/NVIDIA/nvidia-container-toolkit/pull/428/files#diff-61ad274e8f1d9eb6d3aa8c961e3bf217f50363406924b4b4b1e40f049a5bb390R49
- https://github.com/NVIDIA/nvidia-container-toolkit/pull/428/files#diff-e43f2243963e5310d841effc7b372f8ecb0f27b06a0a9fd374fd3196101d2410R195
- https://github.com/NVIDIA/k8s-device-plugin/pull/673/files#diff-a9bc30e8fa1d6bf81ad74b60218fd217abfb27090d4f72e187e7f2a19ee4e82fR100
- https://github.com/NVIDIA/k8s-device-plugin/pull/673/files#diff-66f1319e3c95c17402a1547b5ae54f94e114de92111f21648ebe1fd635d6880dR45
Note that the "Properties" themselves may still be required for edge cases, but we could probably make them internal at somet point.
659c752
to
5c049b1
Compare
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds a PropertyExtractor interface to encapsulate functions that query a system for certain capabilities. The IsTegraSystem has been renamed to HasTegraFiles function and marked as Deprecated. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a logger to be used by a info.Interface [1]. Commit 649d02f already bumped the required go-nvlib version to 0.6.0, so take advantage of that. [1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d NVIDIA/go-nvlib@21c8f035ca66b29d NVIDIA/go-nvlib#28 containers#1541
A new API was added to github.com/NVIDIA/go-nvlib 0.4.0 to specify a logger to be used by a info.Interface [1]. Commit 649d02f already bumped the required go-nvlib version to 0.6.0, so take advantage of that. [1] github.com/NVIDIA/go-nvlib commit 21c8f035ca66b29d NVIDIA/go-nvlib@21c8f035ca66b29d NVIDIA/go-nvlib#28 containers#1541
This ports the function from
nvidia-container-toolkit
togo-nvlib
for more consistent reuse.See:
which incorporate these changes into the toolkit and device plugin.