Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Implement "early pivot" before the MCD comes up #426

Conversation

cgwalters
Copy link
Member

We landed a lot of code to have the MCD call out to pivot.service;
this builds on that to implement an "early pivot" model where
we do the OS update before the node even joins the cluster.

This should result in less disruption, though debuggability is weaker.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2019
@cgwalters
Copy link
Member Author

/hold

This builds on #363

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 13, 2019
@cgwalters
Copy link
Member Author

/retest

@jlebon
Copy link
Member

jlebon commented Feb 14, 2019

Gonna try to test this using the instructions in #425!

@jlebon
Copy link
Member

jlebon commented Feb 14, 2019

Gonna try to test this using the instructions in #425!

(Now pivoting to testing #363 first).

@cgwalters cgwalters force-pushed the osimage-render-with-early-pivot branch from 50c3534 to 3a99457 Compare February 16, 2019 01:42
@ashcrow
Copy link
Member

ashcrow commented Feb 16, 2019

Unit failure:

--- FAIL: TestKubeletConfigCreate (0.13s)
    --- PASS: TestKubeletConfigCreate/aws (0.08s)
    --- PASS: TestKubeletConfigCreate/none (0.03s)
    --- FAIL: TestKubeletConfigCreate/unrecognized (0.01s)
    	kubelet_config_controller_test.go:207: Expected

@runcom
Copy link
Member

runcom commented Feb 16, 2019

Unit failure:

should be #417 which is going to be fixed by #437

@cgwalters
Copy link
Member Author

/retest

1 similar comment
@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

The pod logs are all 0 sized in those two...weird.

/retest

This reverts commit 3808104 - it
didn't work and actively breaks things actually because we no
long substitute the value at build time.
This injects the `OSImageURL` into the "base"
config (e.g. `00-worker`, `00-master`).  This differs from
previous pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.
We landed a lot of code to have the MCD call out to `pivot.service`;
this builds on that to implement an "early pivot" model where
we do the OS update before the node even joins the cluster.

This should result in less disruption, though debuggability is weaker.
@cgwalters cgwalters force-pushed the osimage-render-with-early-pivot branch from 3a99457 to 2fcdd0e Compare February 16, 2019 20:09
@cgwalters
Copy link
Member Author

This one was missing ffd9888 😢

/retest

@cgwalters
Copy link
Member Author

/retest

3 similar comments
@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/test e2e-aws
/test e2e-aws-op

@cgwalters
Copy link
Member Author

/retest

1 similar comment
@smarterclayton
Copy link
Contributor

/retest

@cgwalters
Copy link
Member Author

Yeah, I think this one is working pretty well. This e2e-aws-op run looks good, we ended up at deaee0eeea9cbc2a38381964f0295e0b1e0dda92e4ee3cd46c2b1822da938b0a i.e 47.313 for all masters/workers before the MCD came up.

@smarterclayton
Copy link
Contributor

Cluster launch wasn't substantially longer - you have my approval to merge (we've already cut a beta candidate and this is critical path).

@runcom
Copy link
Member

runcom commented Feb 17, 2019

@cgwalters if you're ok removing the hold, I'll lgtm

@cgwalters
Copy link
Member Author

/hold cancel

Yep let's do this one!

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019
@runcom
Copy link
Member

runcom commented Feb 17, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member

runcom commented Feb 18, 2019

time="2019-02-18T00:15:29Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-18T00:15:29Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:15:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:16:02Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:17:32Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:17:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:18:32Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:18:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:21:01Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:21:32Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator openshift-samples is reporting a failure: Samples installation in error at 4.0.0-alpha1-69362431c: "
time="2019-02-18T00:22:20Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator openshift-samples is reporting a failure: Samples installation in error at 4.0.0-alpha1-69362431c: "
time="2019-02-18T00:24:47Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T00:25:41Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:29:02Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T00:30:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:34:17Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T00:37:47Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:41:02Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator network has not yet reported success"
time="2019-02-18T00:45:02Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-18T00:45:29Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

/retest

@cgwalters
Copy link
Member Author

Just to reiterate, now that this has landed, both in CI and for people using the installer git master:

Today the installer pulls a floating RHCOS on boot, but machine-os-content is currently fixed in the release payload, so the OS will be downgraded until we've automated pushes of the latter.

See also openshift/origin#21998

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants