Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Remove m3m ownerRef from bmh #1008

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shibaPuppy
Copy link
Contributor

What this PR does / why we need it:
don't control bmh power state if it goes to provisioned state
provisioned was completed, we should be able to power off the node on a case-by-case.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@metal3-io-bot metal3-io-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 31, 2023
@metal3-io-bot
Copy link
Contributor

Hi @shibaPuppy. Thanks for your PR.

I'm waiting for a metal3-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@metal3-io-bot metal3-io-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 31, 2023
@mboukhalfa
Copy link
Member

/ok-to-test

@metal3-io-bot metal3-io-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 1, 2023
@Sunnatillo
Copy link
Member

/test-centos-e2e-integration-main
/test-ubuntu-integration-main

@furkatgofurov7
Copy link
Member

provisioned was completed, we should be able to power off the node on a case-by-case.

Thanks for the PR @shibaPuppy. Maybe I missing the use case here, but regardless of the host state, we should not power down the node. Can you elaborate more on your use-case?

@shibaPuppy
Copy link
Contributor Author

provisioned was completed, we should be able to power off the node on a case-by-case.

Thanks for the PR @shibaPuppy. Maybe I missing the use case here, but regardless of the host state, we should not power down the node. Can you elaborate more on your use-case?

@furkatgofurov7
We are using libvirt node with CAPM3.

I think it should be possible to force power off the vm to proceed with the evacuation due to memory issues.

Similarly, I thought that the PM HW issue should be able to be forcibly turned off.

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Jun 12, 2023

I think it should be possible to force power off the vm to proceed with the evacuation due to memory issues.

that could be taken care by features, like remediation IMO rather than forcing the controller to power down the node

@shibaPuppy
Copy link
Contributor Author

that could be taken care by features, like remediation IMO rather than forcing the controller to power down the node

In some cases, after turning off the power, the node where the OS is installed is moved to an isolated rack for inspection.

I haven't used machine-health-check yet.
the kubelet may have issues, but the libvirt vm may not.
may not be able to proceed with reinstalling the OS.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2023
@metal3-io-bot
Copy link
Contributor

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

@metal3-io-bot
Copy link
Contributor

@metal3-io-bot: Closed this PR.

In response to this:

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shibaPuppy
Copy link
Contributor Author

@Rozzii
plz confirm.

@Rozzii
Copy link
Member

Rozzii commented Oct 18, 2023

/reopen

@metal3-io-bot metal3-io-bot reopened this Oct 18, 2023
@metal3-io-bot
Copy link
Contributor

@Rozzii: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Rozzii
Copy link
Member

Rozzii commented Oct 18, 2023

As far as I can see there were no decision on this PR and @shibaPuppy is still active so I have reopened this.
/cc @dtantsur @kashifest @zaneb @elfosardo @hroyrh @hardys let's expand the scope of reviewers a a bit.

@Rozzii Rozzii removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2023
@zaneb
Copy link
Member

zaneb commented Oct 24, 2023

I agree that we should not be continuously forcing the power control to be on, which it appears we are currently doing because setHostSpec() is called from Update(). We just need to make sure it is on at the time the host is provisioned.

I think I'd prefer to set it inside the same block that we are setting the image, rather than relying on the state, since that could be racy. But in practice I expect this is probably ok.

/cc @honza

@shibaPuppy
Copy link
Contributor Author

@honza @zaneb @Rozzii
really want this feature 😭

@zaneb
Copy link
Member

zaneb commented Jan 14, 2024

Looking into this a bit more deeply...

I looked at our fork of the original CAPBM (from the pre-CAPM3 times), and the provisionHost() function (equivalent of setHostSpec() here) also sets Online to true unconditionally. That's OK though, because it is only called from Create(), so it only happens when the Machine is in the Provisioning phase.

What really changed in CAPM3 is that 31d394f started calling setHostSpec() not only from Associate() but also from Update(). Presumably this means it now gets called all the time no matter what phase the Machine is in. Since the purpose of this patch was to set the owner reference on the BMH, which we've already agreed (in #1358) is wrong, maybe we should revert that whole patch.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2024
@Rozzii
Copy link
Member

Rozzii commented May 6, 2024

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 6, 2024
@Rozzii
Copy link
Member

Rozzii commented May 6, 2024

/hold
I am holding this based on @zaneb 's reply #1008 (comment) , could we discuss this @shibaPuppy ?

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@shibaPuppy
Copy link
Contributor Author

@Rozzii @zaneb
I am not sure about my opinion as I lack a deep understanding of the CAPM3.

I think the power on function of baremetal nodes through a call to provisionHost() is essential, but I don't think there is a need to force the state in metal3machine when it is in provisioned state.

@shibaPuppy
Copy link
Contributor Author

I think I'd prefer to set it inside the same block that we are setting the image, rather than relying on the state, since that could be racy. But in practice I expect this is probably ok.

Does the image block match that line?

https://github.com/metal3-io/cluster-api-provider-metal3/pull/1008/files#diff-cd2cdab0bdcc6b6e359db724b479ca86f879705fbfbcdddba7dbc876941e89bbR1121-L1151

@kashifest @zaneb @Rozzii

@zaneb
Copy link
Member

zaneb commented May 23, 2024

@Rozzii @zaneb I am not sure about my opinion as I lack a deep understanding of the CAPM3.

I think the power on function of baremetal nodes through a call to provisionHost() is essential, but I don't think there is a need to force the state in metal3machine when it is in provisioned state.

Yes, correct. It should do it once and leave it.

I think we should stop calling setHostSpec() from Update(), which would have the same effect by reducing complexity instead of adding to it.

We should also stop setting the owner reference, because we agreed in #1358 that that's a bug.

A straight revert of 31d394f would achieve both.

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2024
@metal3-io-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii
Once this PR has been reviewed and has the lgtm label, please ask for approval from kashifest. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii
Once this PR has been reviewed and has the lgtm label, please ask for approval from kashifest. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 8, 2024
@shibaPuppy shibaPuppy force-pushed the main branch 2 times, most recently from f2e9bc9 to 83104bb Compare June 8, 2024 18:44
@shibaPuppy
Copy link
Contributor Author

/retitle 🐛 Remove m3m ownerRef from bmh

@metal3-io-bot
Copy link
Contributor

@shibaPuppy: Re-titling can only be requested by trusted users, like repository collaborators.

In response to this:

/retitle 🐛 Remove m3m ownerRef from bmh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shibaPuppy shibaPuppy changed the title Change the power control of bmh 🐛 Remove m3m ownerRef from bmh Jun 8, 2024
@shibaPuppy
Copy link
Contributor Author

@Rozzii @zaneb
owerRef commit was not reverted, so I wrote a commit to remove it.

metal3-io@31d394f

Signed-off-by: shibaPuppy <sub951@naver.com>
@dtantsur
Copy link
Member

Do we need to keep DeleteOwnerRef for the time being? I'm thinking of an upgrade scenario where someone already has hosts with OwnerReferences that now are going to stick forever.

@Rozzii Rozzii added this to the 1.8.0 milestone Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: CAPM3 WIP
Development

Successfully merging this pull request may close these issues.

None yet

10 participants