Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly (regular) build of container images #666

Closed
jlewi opened this issue Apr 16, 2018 · 15 comments
Closed

Nightly (regular) build of container images #666

jlewi opened this issue Apr 16, 2018 · 15 comments
Assignees

Comments

@jlewi
Copy link
Contributor

jlewi commented Apr 16, 2018

We are adding more and more Docker images (e.g. PyTorch, Central UI, katib etc...).

We need a way to regularly rebuild and push the latest images to a public repository e.g. gcr.io/kubeflow-images-staging or gcr.io/kubeflow-ci

Currently this is all done manually.

Our release process consists of a bunch of Argo workflows.

So if we wanted to automate this we could just create a cron job to invoke run_e2e_workflow.py using a YAML file that specifies all the workflows for a release.

@willb Will any interest in taking this on?

/cc @jose5918
/cc @willb

@jose5918
Copy link
Contributor

@jlewi what do you think about just doing this as part of the postsubmit process? Just check that the JOB_TYPE=postsubmit and release the latest image then to avoid having a cron job.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 17, 2018

So triggering it based on postsubmits is ideal. Here's the challenge. We don't want uncommitted code to be able to write our release artifacts (including nightly builds).

Our test infrastructure runs "minimally vetted code" in the sense that any PR labeled /ok-to-test will run the code. I don't want unmerged code to be able to write/modify our release artifacts..

Also the way we currently use Prow to trigger workflows is the K8s ci bot creates arbitrary argo workflows. Which creates another large whole for code to be injected that could modify our release artifacts.

All of this solvable but we would need to think it through and lock it down.

Right now we build our release artifacts in a separate GCP project in a separate GKE cluster. So none of our test infra has access to our release repositories. So we just need to figure out how to trigger jobs in that cluster without comprising security.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 17, 2018

This project looks promising as a way of building container images in cluster
https://github.com/GoogleContainerTools/kaniko

@jose5918
Copy link
Contributor

Post submits don't run until code is commited already correct? And I meant it in that we can release to the staging area during post submit and manually release official images.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 18, 2018

That's correct. But our post submit tests are running using the same credentials as our presubmits. So in the current setup we can't give access to the staging repository for our postsubmits without also granting that permission to code running in presubmits. At that point we can no longer trust the images in the staging area.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 30, 2018

/assign @kunmingg

@jlewi
Copy link
Contributor Author

jlewi commented May 14, 2018

@kunmingg Any progress on this in the last sprint?

@kunmingg
Copy link
Contributor

@jlewi May need another 2 days?

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

@kunmingg What's the status of this? Which Docker images are now being built regularly?

@kunmingg
Copy link
Contributor

@jlewi Now we have code merged in, and I'll turn on cron job today.
These are current two workflows which cron will cover:
https://github.com/kubeflow/kubeflow/blob/master/releasing/prow_config_release.yaml

@jlewi
Copy link
Contributor Author

jlewi commented May 22, 2018

This is great.

@kunmingg can you file an issue to add the TFJob operator and bootstrapper images to this release workflow?

@kunmingg after we turn the cron job on, we should verify that it runs successfully and then once we've verified it and have images we can close this issue.

@jlewi
Copy link
Contributor Author

jlewi commented May 23, 2018

@kunmingg It looks like the cron job was enabled and new images were built e.g.
https://console.cloud.google.com/gcr/images/kubeflow-images-public/GLOBAL/tensorflow-1.8.0-notebook-cpu?gcrImageListsize=50

Can we close this?

@kunmingg
Copy link
Contributor

Seems we have some failed release tasks, like tf serving gpu, need to verify we have enough hardware resource for auto release.

@kunmingg
Copy link
Contributor

Will update Dockerfile to point to resources within GCP. Creating new issue for it.

@kunmingg
Copy link
Contributor

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants