Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add syncronization between the MPS control daemon and the device plugin #606

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

elezar
Copy link
Member

@elezar elezar commented Mar 19, 2024

This change adds an init-container to the device plugin to ensure that the MPS daemon with the correct config has been started before the device plugin starts.

This init container logs:

I0319 13:48:40.749279      28 wait.go:76] Starting OS watcher.
E0319 13:48:40.749531      28 wait.go:87] "MPS is not ready; retrying ..." err="mps manager is not ready: failed to load .ready config: failed to open .ready file: open /mps/.ready: no such file or directory"
I0319 13:49:10.871775      28 wait.go:84] MPS is ready

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 19, 2024
Signed-off-by: Evan Lezar <elezar@nvidia.com>
ReadyFilePath = "/mps/.ready"
)

// ReadyFile represents a file used to store readyness of the MPS daemon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ReadyFile represents a file used to store readyness of the MPS daemon.
// ReadyFile represents a file used to store readiness of the MPS daemon.

Comment on lines +78 to +79
// Matches checks whether the contents of the ready file matches the specified config.
func (f ReadyFile) Matches(config *spec.Config) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a different verb than Matches to future proof this (since we don't plan on doing an exact match of the full config file in the future).

return nil
}

// Load loads the contents of th read file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Load loads the contents of th read file.
// Load loads the contents of the ready file.


// NewCommand constructs a mount command.
func NewCommand() *cli.Command {
// Create the 'generate-cdi' command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create the 'generate-cdi' command
// Create the 'wait' command

if err != nil {
return nil, fmt.Errorf("unable to validate flags: %v", err)
}
config.Flags.GFD = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth a comment why we nil out the GFD config?

Comment on lines +115 to +118
readyFile := mps.ReadyFile{}
if err := readyFile.Remove(); err != nil {
klog.Warningf("Failed to remove .ready file: %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if MPS is turned off in the config? Will a pevious ready file just be sitting around until MPS get reenabled at some point in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with the current implementation. What options do we have to ensure that this is removed if the MPS daemonset is removed?

Comment on lines +77 to +111
klog.Info("Starting OS watcher.")
sigs := watch.Signals(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)

var retry <-chan time.Time

restart:
err = assertMpsIsReady(config)
if err == nil {
klog.Infof("MPS is ready")
return nil
}
klog.ErrorS(err, "MPS is not ready; retrying ...")
retry = time.After(30 * time.Second)

for {
select {
case <-retry:
goto restart
// Watch for any signals from the OS. On SIGHUP, restart this loop,
// restarting all of the plugins in the process. On all other
// signals, exit the loop and exit the program.
case s := <-sigs:
switch s {
case syscall.SIGHUP:
klog.Info("Received SIGHUP, restarting.")
goto restart
default:
klog.Infof("Received signal \"%v\", shutting down.", s)
goto exit
}
}

}
exit:
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure any of this restart logic is necessary. In fact, I'm not sure if this command should be called wait but rather just assert and then fail if it cannot assert that the MPS daemon is up and running. Since the intention is to run this as an init container, I actually don't want it to loop and instead just keep failing until it asserts that the daemon is up and running.

Comment on lines +229 to +253
if event.Name == mps.ReadyFilePath {
if config == nil || config.Sharing.SharingStrategy() != spec.SharingStrategyMPS {
klog.InfoS("Ignoring event", "event", event)
continue
}
switch {
case event.Op&fsnotify.Remove == fsnotify.Remove:
klog.Infof("%s removed; restarting", mps.ReadyFilePath)
goto restart
case event.Op&(fsnotify.Create|fsnotify.Write) != 0:
klog.Infof("%s created or modified; checking config", mps.ReadyFilePath)
readyFile := mps.ReadyFile{}

matches, err := readyFile.Matches(config)
if err != nil {
klog.ErrorS(err, "failed to check .ready file")
goto restart
}
if !matches {
klog.Info("mismatched MPS sharing config; restarting.")
goto restart
}
continue
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read through this in detail, but at first glance it seems a bit convoluted. Can't we rely on the logic already in-place that handles config updates and hook into that somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible to just rely on the config change restart logic. Let's make the following assumptions:

  1. The MPS daemonset owns the lifecycle of the .ready file and this file contains the current config used to start the MPS control daemon.
  2. If the config changes and both the MPS daemon and the device plugin are restared by their respective config managers, we should be able to fail plugin startup until: 1. A .ready file exists with the expected config, 2. The MPS pipe is ready from the perspective of the device plugin.

This should not require the additional error / watcher events here.

Note that since the config-manager sidecars are not always deployed, this seems to imply that there are configuration changes that we may miss in this case. For example, how do we capture the restart of the MPS daemon or some other configuration update that doesn't use the config manager? Do we expect the user to manually restart both pods in this case to ensure that we're in sync?

Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@frittentheke
Copy link

@elezar I suppose you don't want this change here to stale out?

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2024
@elezar
Copy link
Member Author

elezar commented Aug 12, 2024

@frittentheke you're right, I need to revisit this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants