Run mstfwreset when applying config changes #449

SalDaniele · 2023-05-30T13:26:24Z

As of the most recent update to the Bluefield2 firmware (v24.37.1300)
configuration changes no longer are applied on node reboot without
running mstfwreset().

This can result in a node ending up in a bootloop when applying a new
configuration.

i.e. a sriov-workload-node-policy if applied to updated total vfs. This
change is applied via mstconfig, however the change is not reflected in
NUM_OF_VFS without running mstfwreset(). This resulted in a repeated
call to reboot.

    I0523 20:59:44.059046    7496 mellanox_plugin.go:335] Changing TotalVfs 16 to 12, needs reboot
    I0523 20:59:44.059057    7496 mellanox_plugin.go:172] mellanox-plugin needDrain true needReboot true

Signed-off-by: Salvatore Daniele sdaniele@redhat.com

github-actions · 2023-05-30T13:26:36Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

SalDaniele · 2023-05-30T16:36:53Z

/cc @wizhaoredhat

@SchSeba @adrianchiris @zeeke Can anyone PTAL? Small update that is blocking our work

wizhaoredhat · 2023-05-30T17:33:51Z

LGTM

wizhaoredhat · 2023-05-30T17:34:48Z

Related issue: Mellanox/mstflint#785

wizhaoredhat · 2023-05-30T17:39:21Z

/cc @bn222

wizhaoredhat · 2023-05-30T18:57:57Z

@SalDaniele Could you also describe the bootloop with the config-daemon we were seeing with the latest BF2 firmware in the PR description?

adrianchiris · 2023-05-31T07:22:13Z

we are currently not calling mstfwreset in sriov-network-config-daemon, why is this change needed ?

wizhaoredhat · 2023-05-31T16:16:29Z

@adrianchiris Although we aren't using mstfwreset here. The mstflint package as a whole does have pciutils as a dependency. We should add this dependency in.

Without pciutils, we will get the following error message when running mstfwreset: Continue with reset?[y/N] y Failed -E- failed to run 'setpci -s 0000:c9:02.0 0x0.w'. The setpci command is part of the pciutils package. Signed-off-by: Salvatore Daniele <sdaniele@redhat.com>

github-actions · 2023-06-06T21:37:16Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

SalDaniele · 2023-06-06T21:41:47Z

@adrianchiris I just added an additional commit to address the crux of the issue we are encountering. As of the most recent Bluefields2 firmware update (5-18-23) config changes are no longer applied on next boot as they have been previously. This requires us to call mstfwreset manually. Without doing so, we can end up in a boot loop, where a change is applied by the config daemon, the node reboots, the change is not reflected, etc.

coveralls · 2023-06-06T21:45:16Z

Pull Request Test Coverage Report for Build 5200424111

0 of 19 (0.0%) changed or added relevant lines in 1 file are covered.
8 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-0.2%) to 25.768%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/plugins/mellanox/mellanox_plugin.go	0	19	0.0%

Files with Coverage Reduction	New Missed Lines	%
controllers/sriovibnetwork_controller.go	2	64.15%
api/v1/helper.go	3	41.32%
pkg/daemon/daemon.go	3	42.91%

Totals
Change from base Build 5143456346:	-0.2%
Covered Lines:	1963
Relevant Lines:	7618

💛 - Coveralls

github-actions · 2023-06-06T21:49:33Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

pkg/plugins/mellanox/mellanox_plugin.go

As of the most recent update to the Bluefield2 firmware (v24.37.1300) configuration changes no longer are applied on node reboot without running mstfwreset(). This can result in a node ending up in a bootloop when applying a new configuration. i.e. a sriov-workload-node-policy if applied to updated total vfs. This change is applied via mstconfig, however the change is not reflected in NUM_OF_VFS without running mstfwreset(). This resulted in a repeated call to reboot I0523 20:59:44.059046 7496 mellanox_plugin.go:335] Changing TotalVfs 16 to 12, needs reboot I0523 20:59:44.059057 7496 mellanox_plugin.go:172] mellanox-plugin needDrain true needReboot true Signed-off-by: Salvatore Daniele <sdaniele@redhat.com>

github-actions · 2023-06-07T13:12:19Z

Thanks for your PR,
To run vendors CIs use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

wizhaoredhat · 2023-06-07T16:18:03Z

LGTM
@adrianchiris PTAL

adrianchiris · 2023-06-13T15:10:28Z

pkg/plugins/mellanox/mellanox_plugin.go

+	if err := configFW(); err != nil {
+		return err
+	}
+	if err := resetFW(); err != nil {


im not too keen on always running mstfwreset, perhaps only for DPUs in Embedded mode ?

i need to think about it a bit more

also for DPU you need to run with --reset-sync=1 flag.

see:
https://docs.nvidia.com/networking/pages/viewpage.action?pageId=52009175

with current implementation, does it work for u ?

Aahh good to know. I was testing the current implementation on a cluster w/ the BF in NIC mode, which was working. The issue I was seeing was not specific to running in DPU embedded mode, the node would get stuck in a bootloop if I applied a policy to change the number of vfs, since after rebooting the updated configuration would not be applied, triggering another reboot.

So, you encountered the issue with DPU in NIC mode as well ?

My understanding since submitting this patch is that this is not intended behavior by Nvidia and a firmware patch will come in July to address this issue.

The issue is that the behavior of NIC mode changed too and is broken in a similar to DPU. With DPU mode --sync 1 causes the DPU to hang. I have code that adds --sync 1 but I want it to work correctly before pushing the changes.

My understanding since submitting this patch is that this is not intended behavior by Nvidia and a firmware patch will come in July to address this issue.

that is correct, its a bug and Nvidia should fix the firmware.

@SalDaniele in that case, i think this PR is no longer required and can be closed ?

Yes provided this fix is in the next firmware release, we can close this PR

adrianchiris · 2023-07-05T13:34:24Z

PR is not needed, see discussion above

SalDaniele changed the title ~~Add pciultis dependency~~ Add pciutils dependency May 30, 2023

SalDaniele mentioned this pull request May 30, 2023

Dependency with pciutils Mellanox/mstflint#785

Open

github-actions bot requested a review from bn222 May 30, 2023 17:39

SalDaniele force-pushed the add_pciutil_dep branch from b253c36 to ec3ca85 Compare June 6, 2023 21:37

SalDaniele changed the title ~~Add pciutils dependency~~ Run mstfwreset when applying config changes Jun 6, 2023

SalDaniele force-pushed the add_pciutil_dep branch from ec3ca85 to 3847d7c Compare June 6, 2023 21:49

wizhaoredhat reviewed Jun 6, 2023

View reviewed changes

pkg/plugins/mellanox/mellanox_plugin.go Show resolved Hide resolved

SalDaniele force-pushed the add_pciutil_dep branch from 3847d7c to 2d8391d Compare June 7, 2023 13:12

adrianchiris reviewed Jun 13, 2023

View reviewed changes

adrianchiris closed this Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run mstfwreset when applying config changes #449

Run mstfwreset when applying config changes #449

SalDaniele commented May 30, 2023 •

edited

Loading

github-actions bot commented May 30, 2023

SalDaniele commented May 30, 2023

wizhaoredhat commented May 30, 2023

wizhaoredhat commented May 30, 2023 •

edited

Loading

wizhaoredhat commented May 30, 2023

wizhaoredhat commented May 30, 2023 •

edited

Loading

adrianchiris commented May 31, 2023

wizhaoredhat commented May 31, 2023

github-actions bot commented Jun 6, 2023

SalDaniele commented Jun 6, 2023

coveralls commented Jun 6, 2023 •

edited

Loading

github-actions bot commented Jun 6, 2023

github-actions bot commented Jun 7, 2023

wizhaoredhat commented Jun 7, 2023

adrianchiris Jun 13, 2023

adrianchiris Jun 13, 2023 •

edited

Loading

SalDaniele Jun 13, 2023

adrianchiris Jun 13, 2023

SalDaniele Jun 13, 2023

SalDaniele Jun 13, 2023

bn222 Jun 13, 2023

adrianchiris Jul 4, 2023

SalDaniele Jul 5, 2023

adrianchiris commented Jul 5, 2023

Run mstfwreset when applying config changes #449

Run mstfwreset when applying config changes #449

Conversation

SalDaniele commented May 30, 2023 • edited Loading

github-actions bot commented May 30, 2023

SalDaniele commented May 30, 2023

wizhaoredhat commented May 30, 2023

wizhaoredhat commented May 30, 2023 • edited Loading

wizhaoredhat commented May 30, 2023

wizhaoredhat commented May 30, 2023 • edited Loading

adrianchiris commented May 31, 2023

wizhaoredhat commented May 31, 2023

github-actions bot commented Jun 6, 2023

SalDaniele commented Jun 6, 2023

coveralls commented Jun 6, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5200424111

💛 - Coveralls

github-actions bot commented Jun 6, 2023

github-actions bot commented Jun 7, 2023

wizhaoredhat commented Jun 7, 2023

Choose a reason for hiding this comment

adrianchiris Jun 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianchiris commented Jul 5, 2023

SalDaniele commented May 30, 2023 •

edited

Loading

wizhaoredhat commented May 30, 2023 •

edited

Loading

wizhaoredhat commented May 30, 2023 •

edited

Loading

coveralls commented Jun 6, 2023 •

edited

Loading

adrianchiris Jun 13, 2023 •

edited

Loading