-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject OSImageURL from CVO into templated MachineConfigs #363
Conversation
pkg/controller/template/render.go
Outdated
return MachineConfigFromIgnConfig(role, name, ignCfg), nil | ||
mcfg := MachineConfigFromIgnConfig(role, name, ignCfg) | ||
if osUpdatesEnabledForRole(config, role) { | ||
mcfg.Spec.OSImageURL = config.OSImageURL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe also log something in the else
branch here so it's clear why it's not picking it up for some pools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is we're going to see that message every time something in the cluster changes...I've learned my lesson there about adding debug prints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #348
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sane
Moving discussion of bootstrap over here. Now that I think about it more...before or after this lands, in fact nothing will be reacting to the osimageurl. So we should be able to land the installer PR now. |
OK I started on using labels but am currently hitting a weird error with image corruption trying to deploy my updated controller image that is almost certainly unrelated. And I'm still learning the label API/semantics.
|
99836ca
to
1277d52
Compare
pkg/controller/template/render.go
Outdated
osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role) | ||
if err != nil { | ||
return nil, err | ||
} else if osUpdatesEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this else, you're returning on the branch aboce anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, the other one is usual error handling? Are you saying it'd e.g. be clearer like this?
diff --git a/pkg/controller/template/render.go b/pkg/controller/template/render.go
index 40aa9a8..97cf0e4 100644
--- a/pkg/controller/template/render.go
+++ b/pkg/controller/template/render.go
@@ -258,7 +258,8 @@ func generateMachineConfigForName(config *RenderConfig, role, name, path string)
osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
if err != nil {
return nil, err
- } else if osUpdatesEnabled {
+ }
+ if osUpdatesEnabled {
mcfg.Spec.OSImageURL = config.OSImageURL
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's how it's usually done when you have an if branch which returns
1277d52
to
bccb216
Compare
to clear out confusion, I believe this comment no longer stands, it's the other way around right? |
Yeah, I updated the PR description to match the current commit message. |
/hold For me to add some tests here that verify that the MCD has a target osimageurl. |
bccb216
to
d0211fe
Compare
d0211fe
to
4094236
Compare
/test e2e-aws-op |
...
🎊 |
Ah but here's the next problem, the We went from
|
Other PRs are going in though. I am still worried that something we're doing in the updates is affecting the cluster or later tests. I'm not seeing a consistent pattern yet though. |
HAProxy and Prometheus flakes /retest |
Hmm. Other PRs are merging...still worried there's something "residual" we're doing to the cluster. But I just logged into this current e2e-aws cluster, and it looks fine... |
Well that last run was a huge set of failures. Reading the logs, the pools and the operator are reporting ready/done before everything is updated, because they key off currentConfig which we set quickly. I think we should change the MCD to only set currentConfig when it's done pivoting. Otherwise the pools are lying. But even then, the systems seem to be updated. E.g one of the master MCDs says: And the tests start much later:
|
/retest |
@cgwalters there are some bigger CI fixes that are in progress (in other repos for e2e-aws) that should be going in in the next few days. We can always hold this until you are back on Tuesday and see if the CI resolves itself. |
/retest |
Looks like that last run hit this. |
wha 🤔 |
/retest |
network failures in the last run (Haproxy being always there) + one about security context which is the first time I see it |
/test images |
I wonder how #442 impacts this PR as well, tests seem to be quite stable there now (especially e2e-aws despite the usual flakes). |
On the mount thing you somehow caught the new 4.0 base image (based on UBI and has no util-linux). will fix your issue and also allow that to get pulled. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
I think this is working because |
yeah, that should be fixed assuming we're testing with the right crio package /cc @mrunalp |
Yeah, machine-os-content was 313 which didn't have the fix. I just bumped to 318 /retest |
OK my view of status here. First...I think I fixed #426 and I'd like to go with that (which is an additional commit on top) because this version isn't correctly having the controller manage updates, basically a node comes up and joins the cluster, MCD lands then we immediately reboot (not gated to 1 at a time by the controller) - I see that happening rapidly for the workers e.g.
It's definitely downgrading the cluster. If some of these e2e failures come down to e.g. tests assuming slightly newer kubelet or cri-o then indeed we should bump |
/lgtm cancel Since I'd like to keep trying for #426 |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlebon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
@cgwalters: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Closing this one in favor of #426 which includes it. |
Add IBM cloud managed profile manifest patch
This injects the
OSImageURL
into the "base"config (e.g.
00-worker
,00-master
). This differs fromprevious pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.