Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading to cloud platforms: GCP #147

Closed
dustymabe opened this issue Feb 20, 2019 · 23 comments
Closed

Uploading to cloud platforms: GCP #147

dustymabe opened this issue Feb 20, 2019 · 23 comments
Assignees
Labels
cloud* related to public/private clouds jira for syncing to jira

Comments

@dustymabe
Copy link
Member

dustymabe commented Feb 20, 2019

This is part of #146 and tracks the work/discussion around uploading to GCP.

@dustymabe dustymabe added the cloud* related to public/private clouds label Feb 20, 2019
@bgilbert
Copy link
Contributor

I think this involves the following:

  • Have GCP create a project to host the images. That project is traditionally called <something>-cloud. We should decide whether the project should be FCOS-specific or Fedora-wide. Our GCP contacts would like to meet with someone from releng to start setting this up.
  • Adapt plume for FCOS.
  • After testing, have GCP mark the project public.

@bgilbert bgilbert changed the title Uploading to cloud platforms: GCE Uploading to cloud platforms: GCP Mar 20, 2019
@dustymabe
Copy link
Member Author

related: coreos/coreos-assembler#493

@miabbott miabbott added the jira for syncing to jira label Sep 13, 2019
@bgilbert
Copy link
Contributor

bgilbert commented Oct 9, 2019

Infra ticket to create the requisite GCP projects.

@cgwalters
Copy link
Member

I'm working on some GCP bits for RHCOS (see e.g. openshift/installer#2921 ) and not having FCOS there inhibits doing things upstream first.

@dghubble
Copy link
Member

With manually uploaded GCP images, zincati checks for updates and reboots promptly (speedy!) after boot which has some interesting cascading affect on interrupting inital cluster bootstrapping that I noticed today. Being able to use a latest GCP channel image would have a nice side-benefit of (mostly) clearing up immediate reboot issues.

@cgwalters
Copy link
Member

@dghubble slightly related coreos/zincati#251
But you probably need to be using https://github.com/coreos/airlock or an equivalent if you aren't.

@dustymabe dustymabe self-assigned this Mar 29, 2020
@LorbusChris
Copy link
Contributor

I think also related here is #392

In OKD we're working around this by adding a etc/zincati/config.d/90-disable-feature.toml that explicitly disables Zincati updates with Ignition: https://github.com/openshift/installer/blob/fcos/data/data/bootstrap/files/etc/zincati/config.d/90-disable-feature.toml

@bgilbert
Copy link
Contributor

@LorbusChris That approach makes sense when the cluster has its own mechanism for updating nodes. In general, though, we shouldn't encourage users to disable updates.

@LorbusChris
Copy link
Contributor

Indeed, we shouldn't encourage that in general. My thinking was that in this specific case of a (probably short-lived) bootstrap host it may make some sense :)

@dghubble
Copy link
Member

Thanks for the suggestions. For now I'm content to follow this issue for GCP uploads. Manually uploading new images or sometimes having to retry the bootstrap in this one case is ok to me as I'd like to keep auto-updates enabled (its a master/controller, rather than a toss-away like OKD uses I think).

Mainly just highlighting that stale manual images can cause these reboots (e.g. new instance creation, after GCE preemption) that wouldn't otherwise be needed, until the channel is in place.

@lucab
Copy link
Contributor

lucab commented Mar 31, 2020

@dghubble good feedback points on poseidon/typhoon#687. I have a few more things for you:

  1. Zincati only runs after boot-complete.target. If your bootstrapping flow needs to happen before any update, you can delay reaching that target until done.
  2. if you are bootstrapping from scratch and can't afford some lock-managing service around, it sounds like you may want something like updates: new strategy based on local filesystem zincati#245 implemented
  3. similar to the above, path-activation may be interesting too but I'm unsure on how to plug the default-on and Ignition disabling into that
  4. in general some deterministic trigger is preferred over an arbitrary timer/delay. Both the timer-activation and the SSH-delay would just make races harder to detect.

@dghubble
Copy link
Member

dghubble commented Apr 1, 2020

Thanks @lucab! I think I'd wanna aim for the lock managing approach personally - a local file strategy could help in a pinch here (for bootstrap) and a later an airlock atop the cluster could get (roughly) one at a time node updates (I don't have strong consistency needs). I looked at airlock's etcd strategy, but I'm reluctant to give it access policy-wise or deal with giving it a TLS client cert, etc. I'd picture airlock (or similar) maybe running as a pod with rbac to write a configmap or annotation, some bit of scratch space for locked yes/no. airlock is looking promising and simpler than cluo was, avoiding the coordinator/daemonset.

For 1, I'd worry about blocking the target on future reboots (bootstrap is once per cluster lifetime). 3 yeah is kinda , and agreed that 4 may not be the best for this situation.

Maybe I'll move this to a (new?) issue so I don't hijack.

@dustymabe
Copy link
Member Author

Some updates...

I think this involves the following:

  • Have GCP create a project to host the images. That project is traditionally called <something>-cloud. We should decide whether the project should be FCOS-specific or Fedora-wide. Our GCP contacts would like to meet with someone from releng to start setting this up.

We have a fedora-coreos-cloud project now.

  • Adapt plume for FCOS.

We adapted ore within COSA to support uploading to GCP and doing what we need.

We modified the pipeline to start doing the upload:

  • After testing, have GCP mark the project public.

GCP is working on making the fedora-coreos-cloud project public. In the meantime I've individually marked the images from our latest testing and stable releases as public so you should be able to use them today:

# gcloud compute images list --project fedora-coreos-cloud --no-standard-images --show-deprecated 
NAME                                       PROJECT              FAMILY                       DEPRECATED  STATUS
fedora-coreos-31-20200323-3-2-gcp-x86-64   fedora-coreos-cloud  fedora-coreos-stable                     READY
fedora-coreos-31-20200407-2-2-gcp-x86-64   fedora-coreos-cloud  fedora-coreos-testing                    READY

@vrutkovs
Copy link
Member

Is there a bucket with .tar.gz artifact uploaded so that OKD could reuse it?

@dustymabe
Copy link
Member Author

the url for the tar.gz within GCP is in the meta.json. My understanding is that it's accessible to any authenticated users. Previously I had considered the file in the image bucket to be purely ephemeral (i.e. only needing to exist for the image import to happen), but I've recently learned of openshift creating images on the fly during install. I'd like to understand why we need to do that if an image is available publicly (and globally) already? Different guestOSfeatures? Encryption?

@cgwalters
Copy link
Member

but I've recently learned of openshift creating images on the fly during install. I'd like to understand why we need to do that if an image is available publicly (and globally) already? Different guestOSfeatures? Encryption?

The idea broadly speaking was that RHEL CoreOS is an implementation detail of OpenShift. Hence, there wasn't a desire to publish pre-built images globally. This might change at some point.

It would seem reasonable to me though to change openshift-install to support directly using a public GCP image if it's available, it'd just require some conditionals in the logic for terraform.

@dustymabe
Copy link
Member Author

but I've recently learned of openshift creating images on the fly during install. I'd like to understand why we need to do that if an image is available publicly (and globally) already? Different guestOSfeatures? Encryption?

The idea broadly speaking was that RHEL CoreOS is an implementation detail of OpenShift. Hence, there wasn't a desire to publish pre-built images globally. This might change at some point.

Yeah I think that makes sense. Since we are publishing images already for Fedora CoreOS I think it makes sense to do what you suggest below 👇🏼 for Fedora CoreOS/OKD.

It would seem reasonable to me though to change openshift-install to support directly using a public GCP image if it's available, it'd just require some conditionals in the logic for terraform.

@dustymabe
Copy link
Member Author

dustymabe commented May 7, 2020

remaining items here:

@dustymabe
Copy link
Member Author

updated #147 (comment) with current status.

@lucab
Copy link
Contributor

lucab commented May 15, 2020

@dghubble yes, feel free to move to another ticket or a forum discussion. From your last reply, it looks like a volatile fragment in /run to use coreos/zincati#245 on first-boot only could be a good solution.

@dustymabe
Copy link
Member Author

updated #147 (comment) with current status.

@dghubble
Copy link
Member

I'm using the new uploaded GCP images instead of manually uploading. Many thanks for these!

# Terraform example
# Fedora CoreOS most recent image from stream
data "google_compute_image" "fedora-coreos" {
  project = "fedora-coreos-cloud"
  family  = "fedora-coreos-${var.os_stream}"
}

@dustymabe
Copy link
Member Author

ok all things in #147 (comment) are now done or broken out into other tickets. I also created a new ticket for us getting this plumbed through to our download page: #494

I'm going to mark this as done. It was a long journey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud* related to public/private clouds jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

8 participants