Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push-build.sh container image pushes should precede staging GCS artifacts and writing version markers #1693

Closed
justaugustus opened this issue Nov 6, 2020 · 9 comments
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@justaugustus
Copy link
Member

What happened:

Tracking issue for https://kubernetes.slack.com/archives/CJH2GBF7Y/p1604669572198400.
Noticed in kubernetes/test-infra#19483.

Our attempts to move the ci-kubernetes-build to Community Infra are failing because container images are not successfully getting pushed.

Comment from @ameukam (kubernetes/test-infra#19483 (comment)):

do this via adding the service account e-mail address to the k8s-infra-staging-ci-images@kubernetes.io group?

ci-kubernetes-build-canary still fails even after the service account is added (see kubernetes/k8s.io#1393) to k8s-infra-staging-ci-images@kubernetes.io : https://testgrid.k8s.io/sig-testing-canaries#build-master-canary

prow-build service account inherits of the permissions of the role roles/cloudbuild.builds.editor as member of k8s-infra-staging-ci-images@kubernetes.io :

https://github.com/kubernetes/k8s.io/blob/74bfdc5741bdde3b8f489bdd8327474101b3b5e4/infra/gcp/lib.sh#L209-L231

which is not enough to make the job successful.

That's a credential issue that needs to be fixed in parallel.

This issue is specifically for some of my expectations around push-build.sh behavior.

What you expected to happen:

  1. Any build jobs should verify access to the container image registry before proceeding

This is a fail-fast scenario.
If we know that a build is supposed to push GCR images, we should check that we're able to do that first, instead of build artifacts and waiting for the container push failure at the end of the scenario.

  1. The check for the existence of a build only checks for GCS bucket artifacts, not container images

In scenarios/kubernetes_build.py

https://github.com/kubernetes/test-infra/blob/329444781ba13be597917343cca4aa1b92366b6d/scenarios/kubernetes_build.py#L45-L84

If we consider a "complete" build to also include container images, this check should verify that those exist as well before claiming a build is not required.

  1. A build should not push artifacts if it cannot guarantee that all of them will be available

The current push-build.sh logic:

release/push-build.sh

Lines 867 to 918 in 4c6b5aa

##############################################################################
common::stepheader COPY RELEASE ARTIFACTS
##############################################################################
attempt=0
while ((attempt<max_attempts)); do
if $USE_BAZEL; then
release::gcs::bazel_push_build $GCS_DEST $LATEST $KUBE_ROOT/_output \
$RELEASE_BUCKET && break
else
release::gcs::locally_stage_release_artifacts $LATEST \
$KUBE_ROOT/_output \
$FLAGS_release_kind
if ((FLAGS_fast)); then
BUILD_DEST="$GCS_DEST/fast"
else
BUILD_DEST="$GCS_DEST"
fi
release::gcs::push_release_artifacts \
$KUBE_ROOT/_output/gcs-stage/$LATEST \
gs://$RELEASE_BUCKET/$BUILD_DEST/$LATEST && break
fi
((attempt++))
done
((attempt>=max_attempts)) && common::exit 1 "Exiting..."
if [[ -n "${FLAGS_docker_registry:-}" ]]; then
##############################################################################
common::stepheader PUSH DOCKER IMAGES
##############################################################################
# TODO: support Bazel too
# Docker tags cannot contain '+'
release::docker::release $FLAGS_docker_registry ${LATEST/+/_} \
$KUBE_ROOT/_output
fi
# If not --ci, then we're done here.
((FLAGS_ci)) || common::exit 0 "Exiting..."
if ! ((FLAGS_noupdatelatest)); then
##############################################################################
common::stepheader UPLOAD to $RELEASE_BUCKET
##############################################################################
attempt=0
while ((attempt<max_attempts)); do
release::gcs::publish_version $GCS_DEST $LATEST $KUBE_ROOT/_output \
$RELEASE_BUCKET $GCS_EXTRA_VERSION_MARKERS && break
((attempt++))
done
((attempt>=max_attempts)) && common::exit 1 "Exiting..."
fi

Here, we should probably attempt to publish artifacts in the following order:

  1. container images
  2. GCS artifacts
  3. version marker

That way, if images fail to push, then the build job fails before copying to GCS.
If there's nothing in the bucket, then the check in #1 will cause a new build to always be attempted.

@hasheddan -- I'll leave you to divide up the work as appropriate.

/assign @hasheddan @ameukam @cpanato
cc: @kubernetes/release-engineering @spiffxp
/priority critical-urgent

How to reproduce it (as minimally and precisely as possible):

See kubernetes/test-infra#19483.

Anything else we need to know?:

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Others:
@justaugustus justaugustus added kind/bug Categorizes issue or PR as related to a bug. sig/release Categorizes an issue or PR as relevant to SIG Release. area/release-eng Issues or PRs related to the Release Engineering subproject labels Nov 6, 2020
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 6, 2020
@justaugustus
Copy link
Member Author

FYI @kubernetes/ci-signal, as this broadly explains some build job failures you may be seeing.

@spiffxp
Copy link
Member

spiffxp commented Nov 6, 2020

There is strong overlap with kubernetes/test-infra#18808

Saving the version marker for last hopefully addresses most of the concerns that prevent us from overwriting incomplete builds.

@spiffxp
Copy link
Member

spiffxp commented Jan 25, 2021

Where do we stand on trying to make this happen in v1.21 timeframe?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@mkorbi
Copy link

mkorbi commented Apr 26, 2021

looks like WIP
will we get this done in 1.22? 👍
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@justaugustus justaugustus assigned puerco and unassigned hasheddan Jun 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 20, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests

10 participants