Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOSKOS sometimes fails to be recreated #3600

Closed
ameukam opened this issue Apr 6, 2022 · 6 comments
Closed

BOSKOS sometimes fails to be recreated #3600

ameukam opened this issue Apr 6, 2022 · 6 comments
Labels
area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@ameukam
Copy link
Member

ameukam commented Apr 6, 2022

Description

The Boskos instance running inside the community-owned GKE build clusters sometimes fails to get recreated after rescheduling.

This is seems to be related to a containerd bug present in the current version we are currently using.

ameukam@cloudshell:~ (k8s-infra-prow-build)$ kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build get nodes gke-prow-build-pool5-2021092812495606-3a8095df-5p77 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
containerd://1.4.3

Initial finding: kubernetes/kubernetes#107561
containerd issue: containerd/containerd#4604

Fix merged in containerd 1.6: containerd/containerd#6478 and backported in 1.5.

Possible solutions

Long term

  • Upgrade the GKE clusters to match the version of containerd containing the fix.
    The fix seems be in 1.22-x.gke-.y. (still looking for the exact version)
  • Add a new nodepool either with bigger disks or boot disks with SSD

Short term

  • add a restart policy at the podSpec for boskos

Rapid Mitigation

In case the incident occurs again, delete the boskos pod

  • Requirements
    Access to the build clusters only accessible to:

- amerai@google.com
- chaodai@google.com
- colew@google.com
- fejta@google.com
- linusa@google.com # GitHub: listx
- mpherman@google.com
- slchase@google.com
# sig-testing leads
- bentheelder@google.com
- skuznets@redhat.com
- spiffxp@google.com
- spiffxp@gmail.com
# sig-k8s-infra members
- ameukam@gmail.com
- cblecker@gmail.com
- davanum@gmail.com
- thockin@google.com

gcloud container clusters get-credentials prow-build --region us-central1 --project k8s-infra-prow-build

Locate the boskos pod:

kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build get pods -n test-pods -l app=boskos
NAME                      READY   STATUS    RESTARTS   AGE
boskos-54b6b94f76-z26ml   1/1     Running   0          45m

Delete the pod:

kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build delete pod boskos-54b6b94f76-7v4xw -n test-pods

/area prow
/area infra
/priority important-soon
/milestone v1.24

@ameukam ameukam added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Apr 6, 2022
@k8s-ci-robot k8s-ci-robot added area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters area/infra Infrastructure management, infrastructure design, code in infra/ priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 6, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Apr 6, 2022
@ameukam ameukam added this to Needs Triage in sig-k8s-infra via automation Apr 6, 2022
@dims
Copy link
Member

dims commented Apr 7, 2022

kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build -n test-pods -l app=boskos get pods 
kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build  -n test-pods get pods -o yaml
watch -n 1 kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build get pods -n test-pods -l app=boskos

PODID=boskos-FILL_ME_XYZ

kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build -n test-pods describe pods/$PODID
kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build -n test-pods logs -f $PODID
kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build -n test-pods delete pods/$PODID

@ameukam
Copy link
Member Author

ameukam commented Apr 8, 2022

boskos-reaper was also affected by the bug. The boskos-reaper pod was stuck in ContainerCreating.

image

Direct consequence is the lack of cleaning of GCP resources by boskos.

image

The resources accumulating over time, it was not possible for the tests to pick up resources.

W0408 03:21:44.813] 2022/04/08 03:21:44 main.go:331: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found

I recreated the boskos-reaper pod.

ameukam@cloudshell:~ (k8s-infra-oci-proxy-prod)$ kubectl --context gke_k8s-infra-prow-build_us-central1_prow-build get pods -l app=boskos-reaper -n test-pods
NAME                             READY   STATUS    RESTARTS   AGE
boskos-reaper-7b54fdbb8d-8jzb2   1/1     Running   0          86m

Cleanup as be done:

image

I'll try to monitor this over the incoming week-end.

@ameukam
Copy link
Member Author

ameukam commented May 12, 2022

No incidents until now. Let's keep it under the radar. Long term solution is impactful for the build clusters

/milestone v1.25
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label May 12, 2022
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.24, v1.25 May 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@ameukam
Copy link
Member Author

ameukam commented Aug 10, 2022

No incidents reported until now. The GKE clusters have upgrade to a 1.5 containerd version.

I'll considered this as solved.
/close

@k8s-ci-robot
Copy link
Contributor

@ameukam: Closing this issue.

In response to this:

No incidents reported until now. The GKE clusters have upgrade to a 1.5 containerd version.

I'll considered this as solved.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sig-k8s-infra automation moved this from Backlog (existing infra) to Done Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra Infrastructure management, infrastructure design, code in infra/ area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
Status: Done
sig-k8s-infra
  
Done
Development

No branches or pull requests

4 participants