Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for safe driver loading feature #600

Merged
merged 3 commits into from
Dec 5, 2023

Conversation

ykulazhenkov
Copy link
Collaborator

@ykulazhenkov ykulazhenkov commented Sep 20, 2023

On Node startup, the OFED container takes some time to compile and load the driver.
During that time, workloads might get scheduled on that Node.
When OFED is loaded, all existing PODs that use NVIDIA NICs will lose their network interfaces.
Some such PODs might silently fail or hang.
To avoid such a situation, before the OFED container is loaded,
the Node should get Cordoned and Drained to ensure all workloads are rescheduled.
The Node should be un-cordoned when the driver is ready on it.

The safe driver loading feature is implemented as a part of the upgrade flow,
meaning safe driver loading is a special scenario of the upgrade procedure,
where we upgrade from the inbox driver to the containerized OFED.

depends on: #685

@ykulazhenkov ykulazhenkov added the on hold This enhancement is currently on hold pending additional clarification and evaluation label Sep 20, 2023
@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 2 times, most recently from d72fe32 to 5054f10 Compare September 20, 2023 08:51
@ykulazhenkov
Copy link
Collaborator Author

/retest-all

@abdallahyas
Copy link
Contributor

@ykulazhenkov for the kind CI a rebase is needed to have the revert of the tighting patch applied, for the helm CI, the mofed container seem to have been delayed to be up, causing the 10 minute timeout of the CI to fail, Do you think we need to increase the timeout of the CI to wait for the nic cluster policy to become ready.

@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 2 times, most recently from a1b8ea0 to 05bcf3e Compare November 1, 2023 11:39
@ykulazhenkov ykulazhenkov changed the title Add support for safe driver loading feature feat: add support for safe driver loading feature Nov 1, 2023
@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 3 times, most recently from 37834e9 to 6cf4d7a Compare November 8, 2023 14:39
@ykulazhenkov ykulazhenkov marked this pull request as ready for review November 8, 2023 14:40
@ykulazhenkov
Copy link
Collaborator Author

/retest-all

@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 2 times, most recently from 2d368ce to 18d4400 Compare November 14, 2023 15:31
@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 2 times, most recently from 09dbdff to 598d765 Compare November 22, 2023 17:34
@ykulazhenkov ykulazhenkov marked this pull request as draft November 22, 2023 17:35
@ykulazhenkov
Copy link
Collaborator Author

ykulazhenkov commented Nov 22, 2023

I updated PR code to use network-operator-init-container 0.0.2

@ykulazhenkov ykulazhenkov marked this pull request as ready for review November 23, 2023 13:40
@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

1 similar comment
@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 2 times, most recently from c7b7be3 to 3eca43a Compare November 27, 2023 11:06
@ykulazhenkov ykulazhenkov removed the on hold This enhancement is currently on hold pending additional clarification and evaluation label Nov 27, 2023
@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

main.go Outdated Show resolved Hide resolved
pkg/state/state_ofed.go Outdated Show resolved Hide resolved
api/v1alpha1/nicclusterpolicy_types.go Outdated Show resolved Hide resolved
@ykulazhenkov ykulazhenkov force-pushed the safe-driver-load branch 3 times, most recently from f8a9925 to 76776e3 Compare November 28, 2023 14:34
@ykulazhenkov
Copy link
Collaborator Author

ykulazhenkov commented Nov 29, 2023

I added check to Admission controller to make sure that initContainer and autoUpgrade are enabled when a user tries to enable safeLoad feature.
Also, I removed ofed.initContainer.enable flag from the CR (I kept it in Helm) - this flag has no extra meaning.

@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

1 similar comment
@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

@ykulazhenkov
Copy link
Collaborator Author

ykulazhenkov commented Dec 1, 2023

@adrianchiris @rollandf @almaslennikov I did a big change in the PR after discussion with Adrian: now we use enviroment variable to configure image for the init container. Please, take a look again

Copy link
Member

@rollandf rollandf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, otherwise LGTM

api/v1alpha1/nicclusterpolicy_webhook_test.go Outdated Show resolved Hide resolved
deployment/network-operator/README.md Show resolved Hide resolved
pkg/config/config.go Outdated Show resolved Hide resolved
OSName: nodeAttr[nodeinfo.AttrTypeOSName],
OSVer: nodeAttr[nodeinfo.AttrTypeOSVer],
MOFEDImageName: s.getMofedDriverImageName(cr, nodeAttr, reqLogger),
InitContainerConfig: s.getInitContainerConfig(cr, reqLogger, config.FromEnv().OFEDState.InitContainerImage),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe we would generally want to work with staticconfig provider[1] or just delete it and work with config.FromEnv()...

staticInfoProvider := staticconfig.NewProvider(staticconfig.StaticConfig{CniBinDirectory: cniBinDir})

same goes for the namespace we use in runtimespec (here and in other states)

thoughts ? (ill create a task to align so it will not affect this PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was thinking about using a static provider, but I didn't find any benefits from it.
Maybe we don't need the static provider at all, because it doesn't make sense to cache env variables (we already read them only once)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe will be useful for UT ? (one less source of informaiton)

that is a state gets the CR, some providers, renderer, and do the work.

Copy link
Collaborator

@adrianchiris adrianchiris Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #698 please add your preference there.

Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
@ykulazhenkov
Copy link
Collaborator Author

/retest-nic_operator_helm

Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@e0ne e0ne merged commit f64281f into Mellanox:master Dec 5, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants