Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fix metrics-server deployment on CI jobs #103713

Merged
merged 2 commits into from
Jul 23, 2021

Conversation

aojea
Copy link
Member

@aojea aojea commented Jul 15, 2021

/kind failing-test
/kind flake
Fixes #103708

NONE

The metrics-server deployed wasn't working correctly due to the recent changes in kubernetes.
Surprisingly, the metrics-server was constantly crash looping but the tests were passing.

This PR fixes the metrics-server deployment:

  • adds liveness and readiness probes so if the metrics-server fails, the job fails
  • metrics-server listen on an unprivileged port (4443)
  • allow to use the kubelet secure port (the insecure port was deprecated IIRC)

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 15, 2021
@aojea
Copy link
Member Author

aojea commented Jul 15, 2021

/assign @liggitt
I just checked the other deployments manifests in the cluster/ folder and compared against the coredns one, that uses this capability

Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 15, 2021
@aojea
Copy link
Member Author

aojea commented Jul 15, 2021

/test

@k8s-ci-robot
Copy link
Contributor

@aojea: The /test command needs one or more targets.
The following commands are available to trigger jobs:

  • /test pull-kubernetes-conformance-image-test
  • /test pull-kubernetes-conformance-kind-ipv6-parallel
  • /test pull-kubernetes-dependencies
  • /test pull-kubernetes-dependencies-go-canary
  • /test pull-kubernetes-e2e-ipvs-azure-dualstack
  • /test pull-kubernetes-e2e-iptables-azure-dualstack
  • /test pull-kubernetes-files-remake
  • /test pull-kubernetes-e2e-gce
  • /test pull-kubernetes-e2e-gce-no-stage
  • /test pull-kubernetes-e2e-gce-kubetest2
  • /test pull-kubernetes-e2e-gce-canary
  • /test pull-kubernetes-e2e-gce-ubuntu
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd
  • /test pull-kubernetes-e2e-gce-ubuntu-containerd-canary
  • /test pull-kubernetes-e2e-gce-alpha-features
  • /test pull-kubernetes-e2e-gce-device-plugin-gpu
  • /test pull-kubernetes-integration
  • /test pull-kubernetes-integration-go-canary
  • /test pull-kubernetes-cross
  • /test check-dependency-stats
  • /test pull-kubernetes-e2e-kind
  • /test pull-kubernetes-e2e-kind-canary
  • /test pull-kubernetes-e2e-kind-ipv6
  • /test pull-kubernetes-e2e-kind-ipv6-canary
  • /test pull-kubernetes-conformance-kind-ga-only
  • /test pull-kubernetes-conformance-kind-ga-only-parallel
  • /test pull-kubernetes-e2e-kops-aws
  • /test pull-kubernetes-bazel-build-canary
  • /test pull-kubernetes-bazel-test-canary
  • /test pull-kubernetes-bazel-test-integration-canary
  • /test pull-kubernetes-local-e2e
  • /test pull-kubernetes-unit
  • /test pull-kubernetes-unit-experimental
  • /test pull-publishing-bot-validate
  • /test pull-kubernetes-e2e-aks-engine-windows-dockershim
  • /test pull-kubernetes-e2e-aks-engine-windows-containerd
  • /test pull-kubernetes-e2e-aks-engine-azure-disk-windows-dockershim
  • /test pull-kubernetes-e2e-aks-engine-azure-file-windows-dockershim
  • /test pull-kubernetes-e2e-capz-windows-dockershim
  • /test pull-kubernetes-e2e-aks-engine-gpu-windows-dockershim
  • /test pull-kubernetes-e2e-aks-engine-azure-disk-windows-containerd
  • /test pull-kubernetes-e2e-aks-engine-azure-file-windows-containerd
  • /test pull-kubernetes-e2e-capz-azure-disk
  • /test pull-kubernetes-e2e-capz-azure-disk-vmss
  • /test pull-kubernetes-e2e-capz-azure-file
  • /test pull-kubernetes-e2e-capz-azure-file-vmss
  • /test pull-kubernetes-e2e-capz-conformance
  • /test pull-kubernetes-e2e-capz-ha-control-plane
  • /test pull-kubernetes-e2e-gce-network-proxy-http-connect
  • /test pull-kubernetes-e2e-gce-network-proxy-grpc
  • /test pull-kubernetes-e2e-gci-gce-autoscaling
  • /test pull-kubernetes-e2e-kind-dual-canary
  • /test pull-kubernetes-e2e-kind-ipvs-dual-canary
  • /test pull-kubernetes-e2e-gci-gce-ingress
  • /test pull-kubernetes-e2e-ubuntu-gce-network-policies
  • /test pull-kubernetes-e2e-gci-gce-ipvs
  • /test pull-kubernetes-node-e2e
  • /test pull-kubernetes-node-e2e-podutil
  • /test pull-kubernetes-e2e-containerd-gce
  • /test pull-kubernetes-node-e2e-containerd
  • /test pull-kubernetes-node-e2e-containerd-features
  • /test pull-kubernetes-node-e2e-alpha
  • /test pull-kubernetes-node-kubelet-serial
  • /test pull-kubernetes-node-kubelet-eviction
  • /test pull-kubernetes-node-kubelet-serial-cpu-manager
  • /test pull-kubernetes-node-kubelet-serial-topology-manager
  • /test pull-kubernetes-node-kubelet-serial-hugepages
  • /test pull-kubernetes-node-crio-cgrpv2-e2e
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
  • /test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
  • /test pull-kubernetes-node-crio-e2e
  • /test pull-kubernetes-node-kubelet-serial-memory-manager
  • /test pull-kubernetes-node-memoryqos-cgrpv2
  • /test pull-kubernetes-node-swap-ubuntu
  • /test pull-kubernetes-node-swap-fedora
  • /test pull-kubernetes-e2e-gce-100-performance
  • /test pull-kubernetes-e2e-gce-big-performance
  • /test pull-kubernetes-e2e-gce-correctness
  • /test pull-kubernetes-e2e-gce-large-performance
  • /test pull-kubernetes-kubemark-e2e-gce-big
  • /test pull-kubernetes-kubemark-e2e-gce-scale
  • /test pull-kubernetes-e2e-gce-storage-slow
  • /test pull-kubernetes-e2e-gce-storage-snapshot
  • /test pull-kubernetes-e2e-gce-csi-serial
  • /test pull-kubernetes-e2e-gce-iscsi
  • /test pull-kubernetes-e2e-gce-iscsi-serial
  • /test pull-kubernetes-e2e-gce-storage-disruptive
  • /test pull-kubernetes-typecheck
  • /test pull-kubernetes-verify-govet-levee
  • /test pull-kubernetes-verify
  • /test pull-kubernetes-verify-go-canary
  • /test pull-kubernetes-e2e-windows-gce

Use /test all to run the following jobs:

  • pull-kubernetes-dependencies
  • pull-kubernetes-e2e-gce-ubuntu-containerd
  • pull-kubernetes-integration
  • pull-kubernetes-e2e-kind
  • pull-kubernetes-e2e-kind-ipv6
  • pull-kubernetes-conformance-kind-ga-only-parallel
  • pull-kubernetes-unit
  • pull-kubernetes-node-e2e-containerd
  • pull-kubernetes-node-crio-e2e
  • pull-kubernetes-e2e-gce-100-performance
  • pull-kubernetes-typecheck
  • pull-kubernetes-verify-govet-levee
  • pull-kubernetes-verify

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member Author

aojea commented Jul 15, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2021
@aojea
Copy link
Member Author

aojea commented Jul 15, 2021

/test pull-kubernetes-e2e-gce-iscsi
this is the only job in presubmit I've found using docker as runtime

@dashpole
Copy link
Contributor

/triage accepted
/assign @dgrisonnet

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 15, 2021
@aojea
Copy link
Member Author

aojea commented Jul 21, 2021

/retest

@aojea aojea force-pushed the metrics-server-443 branch from 38092b9 to fb2f0d2 Compare July 21, 2021 21:49
@aojea
Copy link
Member Author

aojea commented Jul 21, 2021

/test pull-kubernetes-e2e-gce-ubuntu

Antonio Ojea added 2 commits July 22, 2021 01:06
@aojea aojea force-pushed the metrics-server-443 branch from 3c3b0a1 to 0610968 Compare July 21, 2021 23:14
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 21, 2021
Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/test pull-kubernetes-integration

@aojea
Copy link
Member Author

aojea commented Jul 22, 2021

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce-ubuntu

@aojea
Copy link
Member Author

aojea commented Jul 22, 2021

it works using the latest image , however, ti seems that the setcap fix was backported to the 0.4 branch and promoted correctly

https://github.com/kubernetes/k8s.io/blob/main/k8s.gcr.io/images/k8s-staging-metrics-server/images.yaml

@serathius does it makes any sense to you?
is ok to move forward using 0.5.0?

I0722 06:55:59.465] Jul 22 06:55:53.209: INFO: 
I0722 06:55:59.465] Jul 22 06:55:55.216: INFO: The status of Pod metrics-server-v0.5.0-6554f5dbd8-9j7xx is Running (Ready = false), waiting for it to be either Running (with Ready = true) or Failed
I0722 06:55:59.465] Jul 22 06:55:55.216: INFO: 27 / 28 pods in namespace 'kube-system' are running and ready (68 seconds elapsed)
I0722 06:55:59.465] Jul 22 06:55:55.216: INFO: expected 6 pod replicas in namespace 'kube-system', 5 are Running and Ready.
I0722 06:55:59.465] Jul 22 06:55:55.216: INFO: POD                                     NODE                                    PHASE    GRACE  CONDITIONS
I0722 06:55:59.466] Jul 22 06:55:55.218: INFO: metrics-server-v0.5.0-6554f5dbd8-9j7xx  e2e-b693b71e40-3cd98-minion-group-zg11  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-07-22 06:54:07 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-07-22 06:54:07 +0000 UTC ContainersNotReady containers with unready status: [metrics-server]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-07-22 06:54:07 +0000 UTC ContainersNotReady containers with unready status: [metrics-server]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-07-22 06:54:07 +0000 UTC  }]
I0722 06:55:59.466] Jul 22 06:55:55.218: INFO: 
I0722 06:55:59.466] Jul 22 06:55:57.235: INFO: 28 / 28 pods in namespace 'kube-system' are running and ready (70 seconds elapsed)
I0722 06:55:59.466] Jul 22 06:55:57.235: INFO: expected 6 pod replicas in namespace 'kube-system', 5 are Running and Ready.
I0722 06:55:59.466] Jul 22 06:55:57.235: INFO: POD  NODE  PHASE  GRACE  CONDITIONS

@dims
Copy link
Member

dims commented Jul 22, 2021

/approve
/lgtm

@aojea if you need this for 1.22, please add a milestone and give a description/justification for @kubernetes/release-team as a comment.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021
@aojea
Copy link
Member Author

aojea commented Jul 22, 2021

/approve
/lgtm

@aojea if you need this for 1.22, please add a milestone and give a description/justification for @kubernetes/release-team as a comment.

up to the @kubernetes/release-team folks, this fixes a release informing job #103708

@dgrisonnet
Copy link
Member

dgrisonnet commented Jul 22, 2021

@aojea To my knowledge, there are no differences between v0.4.4 and v0.5.0 that would change the behavior that we were seeing before. So there should be no risks with moving forward with v0.5.0 with the exception of maybe missing a real bug with dockershim.

In case I double-checked the capabilities of the official images and they are the same:

$ docker cp $(docker create --rm k8s.gcr.io/metrics-server/metrics-server:v0.4.4):metrics-server /tmp/metrics-server && getcap /tmp/metrics-server
/tmp/metrics-server cap_net_bind_service=ep
$ docker cp $(docker create --rm k8s.gcr.io/metrics-server/metrics-server:v0.5.0):metrics-server /tmp/metrics-server && getcap /tmp/metrics-server 
/tmp/metrics-server cap_net_bind_service=ep

@saschagrunert
Copy link
Member

up to the @kubernetes/release-team folks, this fixes a release informing job #103708

Yes, please backport to release-1.22.

@liggitt
Copy link
Member

liggitt commented Jul 22, 2021

/lgtm

@dims
Copy link
Member

dims commented Jul 23, 2021

looks like there is consensus to apply this in 1.22 and fail a cherry-pick.

/milestone v1.22

@aojea please file a cherry-pick to 1.22 branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/provider/gcp Issues or PRs related to gcp provider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaking Test] metrics-server not starting in BeforeSuite (ci-kubernetes-e2e-ubuntu-gce)