Pin OS image for release builds #757

rajatchopra · 2018-11-28T22:12:19Z

Three environment variables have been introduce for build time that will pin the exact OS image to be pulled.
OPENSHIFT_INSTALL_RHCOS_BASE_URL: points to the base url from where to pick the image from. Defaults to "https://releases-rhcos.svc.ci.openshift.org/storage/releases" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_DEFAULT_CHANNEL: points to the channel where the image is available. Defaults to "maipo" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_BUILD_NAME: a string representing the build name. If empty (the default is empty), then the latest one will be picked up. If the provided build name is wrong/unavailable, then the installer will error out.

This PR is a follow up on #732

wking · 2018-11-28T23:35:34Z

Looks like the first commit here overlaps with #732 (but has a different hash)?

rajatchopra · 2018-11-28T23:40:42Z

Looks like the first commit here overlaps with #732?

Yes. They are changing the same place in the build script and I wasn't keen on chasing the conflicts. Could close #732 if this is looking alright. Or let it merge and the commit will resolve by itself.

wking · 2018-11-28T23:45:15Z

Or let it merge and the commit will resolve by itself.

That works if your new commit here was on top of #732's e890b7b instead of on top of c5657fc. You can get that with:

$ git rebase --onto e890b7b HEAD^

wking · 2018-11-29T00:27:16Z

I'd still like to understand how far we are from pivoting to the OS image which was built into the release image. Related discussion in #281.

pkg/rhcos/builds.go

abhinavdahiya · 2018-11-29T23:21:35Z

The envs are similar to what we use as options for installer. These are build envs and they shouldn't resemble user envs.
I think we are going to be branching for each new release and want git history to should show what image we fixed to. we could add fix overrides in build script to fix to a specific version when we branch but that's no different to use doing that in code.

wking · 2018-11-29T23:39:42Z

The envs are similar to what we use as options for installer. These are build envs and they shouldn't resemble user envs.

+1. Probably just drop the OPENSHIFT_INSTALL_ prefixes.

I think we are going to be branching for each new release and want git history to should show what image we fixed to. we could add fix overrides in build script to fix to a specific version when we branch but that's no different to use doing that in code.

Well, I think there is some benefit to collecting these together. It would be great to be able to commit some environment variables to hack/release.sh and build with those pins without having to poke around in the bowels of pkg/. But yeah, at the end of the day, these are going to be strings in the repo, so we should drop the env-var approach if it seems too invasive. Currently the env-var approach seems minimal enough for me to want it, but if it comes down to personal preference I'm fine being overridden ;).

pkg/rhcos/builds.go

rajatchopra · 2018-11-30T00:02:36Z

Looks like the first commit here overlaps with #732 (but has a different hash)?

I will just close #732 and deal with all the commits here.

+1. Probably just drop the OPENSHIFT_INSTALL_ prefixes

Done.

It would be great to be able to commit some environment variables to hack/release.sh

That is a wonderful idea. That is indeed the place it all should be in. env var to build script still has advantage of custom testing a release build easily.
So when we make a release just write/update the image url/buildName into the hack/release.sh

If we have a hard-coded buildName, I'd rather not bother hitting builds.json at all.

Could do it, though I think it helps to give a list of possible ones when the buildName does not match any.

I'm also fine bubbling that build argument up into AMI and QEMU

I say we should do it only if we really need to. On the other hand I would like if we could bubble down the default channel even.

Both suggestions landed in 6521f80.

wking · 2018-11-30T00:40:47Z

If we have a hard-coded buildName, I'd rather not bother hitting builds.json at all.

Could do it, though I think it helps to give a list of possible ones when the buildName does not match any.

We could do that later, on a best-effort case, if we need it. Something like this.

I'm also fine bubbling that build argument up into AMI and QEMU

I say we should do it only if we really need to. On the other hand I would like if we could bubble down the default channel even.

Well, 124df53 has the bubbled-up version if folks want to take a look ;). About bubbling down, that would simplify these APIs slightly, and folks could always supply their own libvirt image or AMI ID if they want to bypass our detection logic. Do we expect other folks to use this package and want more flexibility than "the latest maipo"? If we have any external consumers, they haven't told godocs about their package ;).

cgwalters · 2018-11-30T14:24:27Z

One thing related to this - I think we should consider always pinning the RHCOS build even in git master - then the RHCOS team owns sending PRs to bump it.

Though we need to be aware in this model that e.g. if we want to get a change to kubelet landed it'd have to merge into origin and then a secondary PR here.

ashcrow · 2018-11-30T14:58:15Z

@cgwalters That seems reasonable to me.

crawford · 2018-12-01T00:49:18Z

@cgwalters @ashcrow We had decided that master would always float so that we don't allow sub-components to drift too far from a working state. If we always pin the OS, it becomes very easy for the OS team to drift too far from a working state.

cgwalters · 2018-12-03T18:27:43Z

If we have any external consumers, they haven't told godocs about their package ;).

https://github.com/openshift/machine-config-operator/blob/77b0e7bc3c815fd23e4e220b780be2eda45eb250/pkg/operator/bootstrap.go#L10

at least (was reading the code this morning).

cgwalters · 2018-12-03T18:32:13Z

@cgwalters @ashcrow We had decided that master would always float so that we don't allow sub-components to drift too far from a working state. If we always pin the OS, it becomes very easy for the OS team to drift too far from a working state.

You say "drift from a working state" there twice but isn't clear to me - what would be some concrete examples?

Big picture with this approach then, the state inputs to an install drop down to (installer, update payload) for master, which helps us get closer to the installer release state of simply (installer, ). That said we don't need to debate it in this PR - we can consider it later after this lands.

wking · 2018-12-03T19:03:39Z

If we have any external consumers, they haven't told godocs about their package ;).

https://github.com/openshift/machine-config-operator/blob/77b0e7bc3c815fd23e4e220b780be2eda45eb250/pkg/operator/bootstrap.go#L10 at least (was reading the code this morning).

That looks like pkg/types, not pkg/rhcos?

crawford · 2018-12-03T21:04:49Z

You say "drift from a working state" there twice but isn't clear to me - what would be some concrete examples?

The RHCOS team bumps the kernel, podman, and libc in the same week. At the end of the week, you submit a PR to the installer and then find out that that image no longer works. By the time you've figured out that it was the kernel bump, you've again updated podman. You submit another PR, and it breaks again, this time because there is some conflict between libc and podman. You then spend the rest of the month trying to get a working image without blocking the entire team. I've seen it happen back in the CoreOS days.

While I don't agree with the overall testing approach (testing every PR to every component with a full integration test is pretty inefficient), this is currently how we are testing everything in OpenShift. We can revisit this once we have something stable and ship a product, but in the meantime, I'd rather we just stick with the status quo.

wking · 2018-12-08T00:40:09Z

e2e-aws:

ERROR: logging before flag.Parse: E1207 23:53:16.846995      49 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2212"
level=warning msg="Failed to connect events watcher: Get https://ci-op-jftfn5q4-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2212&watch=true: dial tcp 35.169.222.233:6443: connect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://ci-op-jftfn5q4-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2212&watch=true: dial tcp 35.169.222.233:6443: connect: connection refused"
level=debug msg="added kube-scheduler.156e3220df892228: ip-10-0-2-93_3c8ab69b-fa7b-11e8-be6a-12e836c48fc4 became leader"
level=debug msg="added kube-controller-manager.156e322150b30a61: ip-10-0-2-93_3fcc31ad-fa7b-11e8-a16c-12e836c48fc4 became leader"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"

Feels like an OpenShift API server flake (openshift/origin#21612).

/retest

cgwalters · 2018-12-10T18:32:20Z

I've seen it happen back in the CoreOS days.

With respect - I think what we're doing here is fairly different from that in numerous ways (OS is dedicated to exactly one clustering impl, OS is lifecycle bound with it, OS has more dedicated engineers, and probably the biggest: OS is tracking a more "stable" stream rather than "latest upstream", etc.)

At the end of the week, you submit a PR to the installer and then find out that that image no longer works.

I think we'd be constantly sending PRs - likely one a day. And the way I am thinking about things here, we would likely be testing individual component changes as well before we even submit the PR here.

And if for some reason pinning turns out to be a problem, it'd clearly be easy to just go back to not pinning right?

Finally - if you still disagree, can you help think of an alternative path that gets us RHCOS builds integrated with the CI gating?

ashcrow · 2018-12-11T14:22:29Z

/test e2e-aws

eparis · 2018-12-12T01:30:05Z

While I agree that RHCOS better not have the wild instability the CL had, I am on the side of master always floating.

I think that pinning RHCOS is a particular problem for the pod and runtime team as well. If they make a change to the kubelet or CRIO, I think, we won't actually get an CI coverage for that change until it comes back through an RHCOS update. Am I wrong? do we build and use per-PR RHCOS builds with per PR updated runtime and kubelet?

If we don't get testing of runtime and kubelet until the RHCOS bump I think the fear of 'drifting from working' is very real, and quite unacceptable.

ashcrow · 2018-12-12T14:22:12Z

do we build and use per-PR RHCOS builds with per PR updated runtime and kubelet?

We do not. The runtime packages are currently manually pushed to a repo by folks in that team which we automatically pull from during builds. Kubelet changes come through the origin package repository, though we are working towards getting the kubelet PRs on top of RHCOS builds for kubelet CI.

smarterclayton · 2018-12-12T19:47:42Z

So we have two independent problems that are somewhat coupled:

Gating breaks
Catching breaks early

The current CI structure for most components says “master of a component github is truth”. That means a merge builds an image which is pushed to an integration stream. All other components test from the integration stream. This means gating and break fixing are handled by merges to master.

The coreos team is not currently obeying the integration stream pattern, which is going to be a problem because it means you aren’t a part of the release image. So you need to fix that very soon or we’ve basically failed at release image (no one has made an arguemnt to me why you are special and don’t have to be part of the release image, so I assume you’re working on that and this is a temporary stopgap).

When you fix that, you get break reversion by pushing an older image. But you still need to implement gating before your built image gets pushed.

If the team is >2 weeks from being in the integration image stream, I’d like to know why. If this is a short term thing until you get there, I might be ok. If this is an attempt to not being part of the release payload, then this is the wrong direction.

cgwalters · 2018-12-12T20:24:41Z

If the team is >2 weeks from being in the integration image stream, I’d like to know why.

This is next on my list but I've been trying to get up to speed on hacking on the MCO since the way OSImageURL works today isn't how it should IMO, and that's a prereq for having the oscontainer in the update payload - openshift/machine-config-operator#183

However this is "crossing the threads" a bit - this issue is about the "bootimage", the initial RHCOS the installer uses.

If this is a short term thing until you get there, I might be ok. If this is an attempt to not being part of the release payload, then this is the wrong direction.

OK so...how hard would it be for us to have the "bootimage" (i.e. AMI, libvirt qcow2 URL) embedded in the release payload too? I think I floated this once before and @wking was skeptical but...it's just a small amount of data we could stick in the release payload metadata, we don't even need to extract the release payload fully, just do a few HTTP requests like skopeo inspect does.

ashcrow · 2018-12-12T20:32:46Z

We do have to be very careful and make sure we are prioritizing work. The coreos team is currently working off our backlog using the priority we are given. If working on a CI system is higher priority than other work then @imcleod and Hina would be good contacts. All accepted and prioritized work goes through one of them first.

ashcrow · 2018-12-12T21:13:04Z

I should note some of this is in our backlog and a few cards are prioritized. One of which is in this coming sprint.

But still leave an override env var so that it can be overridden. Use the following env var to pin the image while building the binary: ``` $ # export the release-image variable $ export RELEASE_IMAGE=registry.redhat.io/openshift/origin-release:v4.0" $ # build the openshift-install binary $ ./hack/build.sh $ # distribute the binary to customers and run with pinned image being picked up $ ./bin/openshift-install create cluster... ``` The only way to override it would be to use an env var OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE while running the openshift-install binary.

One environment variable to be used when building the pinned release binary: RHCOS_BUILD_NAME: a string representing the build name(or rather number e.g. 47.48). If empty, then the latest one will be picked up. This could just be hardcoded in the hack/build.sh script before making the release branch so that it stays baked in.

wking · 2018-12-14T17:31:16Z

/lgtm

openshift-ci-robot · 2018-12-14T17:31:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rajatchopra, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rajatchopra,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2018-12-14T20:27:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2018-12-15T00:22:32Z

@rajatchopra: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`6521f80`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2018-12-15T00:29:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

cgwalters · 2019-01-03T16:34:15Z

Moving related conversation to #987

openshift-ci-robot requested review from flaper87 and hardys November 28, 2018 22:12

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 28, 2018

rajatchopra changed the title ~~Pin OS image for builds~~ Pin OS image for release builds Nov 28, 2018

rajatchopra force-pushed the pin_os branch from 23d2efe to a7fa2e7 Compare November 28, 2018 23:38

crawford previously requested changes Nov 29, 2018

View reviewed changes

pkg/rhcos/builds.go Outdated Show resolved Hide resolved

pkg/rhcos/builds.go Outdated Show resolved Hide resolved

rajatchopra force-pushed the pin_os branch from a7fa2e7 to d288ee2 Compare November 29, 2018 20:05

wking mentioned this pull request Nov 29, 2018

hack: pin the release image in build flags #732

Closed

wking reviewed Nov 29, 2018

View reviewed changes

pkg/rhcos/builds.go Outdated Show resolved Hide resolved

rajatchopra force-pushed the pin_os branch from d288ee2 to 6521f80 Compare November 29, 2018 23:50

wking mentioned this pull request Dec 2, 2018

CHANGELOG: Document changes since v0.4.0 #772

Merged

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018

rajatchopra force-pushed the pin_os branch from 6521f80 to 8affc86 Compare December 7, 2018 23:08

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 7, 2018

cgwalters mentioned this pull request Dec 11, 2018

extend to more AWS regions openshift/os#371

Closed

cgwalters mentioned this pull request Dec 11, 2018

Refactor RHCOS fetch API, support fetching qemu from partial config #856

Closed

Rajat Chopra added 2 commits December 14, 2018 12:17

rajatchopra force-pushed the pin_os branch from 532d36b to 07f17f4 Compare December 14, 2018 17:26

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 14, 2018

openshift-ci-robot assigned wking Dec 14, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2018

wking mentioned this pull request Dec 15, 2018

hack/build: Pin to RHCOS 47.212 and quay.io/openshift-release-dev/ocp-release:4.0.0-6 #918

Closed

openshift-merge-robot merged commit 4d230b6 into openshift:master Dec 15, 2018

wking mentioned this pull request Dec 15, 2018

CHANGELOG: Document changes since 5813f611 #917

Merged

cgwalters mentioned this pull request Jan 3, 2019

Embed RHCOS buildID in update payload, fetch it for installer #987

Closed

rajatchopra deleted the pin_os branch January 15, 2019 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin OS image for release builds #757

Pin OS image for release builds #757

rajatchopra commented Nov 28, 2018 •

edited

Loading

wking commented Nov 28, 2018 •

edited

Loading

rajatchopra commented Nov 28, 2018

wking commented Nov 28, 2018

wking commented Nov 29, 2018

abhinavdahiya commented Nov 29, 2018

wking commented Nov 29, 2018

rajatchopra commented Nov 30, 2018

wking commented Nov 30, 2018

cgwalters commented Nov 30, 2018

ashcrow commented Nov 30, 2018

crawford commented Dec 1, 2018

cgwalters commented Dec 3, 2018

cgwalters commented Dec 3, 2018

wking commented Dec 3, 2018

crawford commented Dec 3, 2018

wking commented Dec 8, 2018

cgwalters commented Dec 10, 2018

ashcrow commented Dec 11, 2018

eparis commented Dec 12, 2018

ashcrow commented Dec 12, 2018

smarterclayton commented Dec 12, 2018

cgwalters commented Dec 12, 2018 •

edited

Loading

ashcrow commented Dec 12, 2018

ashcrow commented Dec 12, 2018

wking commented Dec 14, 2018

openshift-ci-robot commented Dec 14, 2018

openshift-bot commented Dec 14, 2018

openshift-ci-robot commented Dec 15, 2018 •

edited

Loading

openshift-bot commented Dec 15, 2018

cgwalters commented Jan 3, 2019

Pin OS image for release builds #757

Pin OS image for release builds #757

Conversation

rajatchopra commented Nov 28, 2018 • edited Loading

wking commented Nov 28, 2018 • edited Loading

rajatchopra commented Nov 28, 2018

wking commented Nov 28, 2018

wking commented Nov 29, 2018

abhinavdahiya commented Nov 29, 2018

wking commented Nov 29, 2018

rajatchopra commented Nov 30, 2018

wking commented Nov 30, 2018

cgwalters commented Nov 30, 2018

ashcrow commented Nov 30, 2018

crawford commented Dec 1, 2018

cgwalters commented Dec 3, 2018

cgwalters commented Dec 3, 2018

wking commented Dec 3, 2018

crawford commented Dec 3, 2018

wking commented Dec 8, 2018

cgwalters commented Dec 10, 2018

ashcrow commented Dec 11, 2018

eparis commented Dec 12, 2018

ashcrow commented Dec 12, 2018

smarterclayton commented Dec 12, 2018

cgwalters commented Dec 12, 2018 • edited Loading

ashcrow commented Dec 12, 2018

ashcrow commented Dec 12, 2018

wking commented Dec 14, 2018

openshift-ci-robot commented Dec 14, 2018

openshift-bot commented Dec 14, 2018

openshift-ci-robot commented Dec 15, 2018 • edited Loading

openshift-bot commented Dec 15, 2018

cgwalters commented Jan 3, 2019

rajatchopra commented Nov 28, 2018 •

edited

Loading

wking commented Nov 28, 2018 •

edited

Loading

cgwalters commented Dec 12, 2018 •

edited

Loading

openshift-ci-robot commented Dec 15, 2018 •

edited

Loading