Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wg-k8s-infra: canary prowjobs for sig-scalability #22430

Merged
merged 2 commits into from
Jun 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
366 changes: 366 additions & 0 deletions config/jobs/kubernetes/wg-k8s-infra/wg-k8s-infra-canaries.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,366 @@
periodics:
- cron: '1 12 * * *' # Run daily at 4:01PST (12:01 UTC)
name: ci-kubernetes-e2e-gce-scale-correctness-canary
cluster: k8s-infra-prow-build
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
preset-e2e-scalability-common: "true"
preset-e2e-scalability-periodics: "true"
preset-e2e-scalability-periodics-master: "true"
decorate: true
decoration_config:
timeout: 270m
annotations:
testgrid-dashboards: wg-k8s-infra-canaries
testgrid-tab-name: gce-master-scale-correctness-canary
description: "Uses kubetest to run correctness tests against a 5000-node cluster created with cluster/kube-up.sh"
spec:
containers:
- image: gcr.io/k8s-testimages/kubekins-e2e:v20210601-ea6aa4e-master
command:
- runner.sh
- /workspace/scenarios/kubernetes_e2e.py
args:
- --cluster=gce-scale-cluster
- --env=CONCURRENT_SERVICE_SYNCS=5
- --env=HEAPSTER_MACHINE_TYPE=e2-standard-32
- --extract=ci/latest-fast
- --extract-ci-bucket=k8s-release-dev
# Overrides CONTROLLER_MANAGER_TEST_ARGS from preset-e2e-scalability-periodics.
- --env=CONTROLLER_MANAGER_TEST_ARGS=--profiling --kube-api-qps=100 --kube-api-burst=100 --endpointslice-updates-batch-period=500ms --endpoint-updates-batch-period=500ms
- --gcp-master-image=gci
- --gcp-node-image=gci
- --gcp-node-size=e2-small
- --gcp-nodes=5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need a quota request for this. I'd rather avoid giving all scalability projects this kind of quota, we should pin to a specific project for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiffxp I'll make the quota requests in k8s-infra-e2e-scale-5k-project. See kubernetes/k8s.io#2225

- --gcp-project=k8s-infra-e2e-scale-5k-project
- --gcp-ssh-proxy-instance-name=gce-scale-cluster-master
- --gcp-zone=us-east1-b
- --ginkgo-parallel=40
- --provider=gce
- --test_args=--ginkgo.skip=\[Serial\]|\[Disruptive\]|\[Flaky\]|\[Feature:.+\]|\[DisabledForLargeClusters\] --minStartupPods=8 --node-schedulable-timeout=90m
- --timeout=240m
- --use-logexporter
- --logexporter-gcs-path=gs://k8s-infra-scalability-tests-logs/$(JOB_NAME)/$(BUILD_ID)
resources:
requests:
cpu: 6
memory: "48Gi"
limits:
cpu: 6
memory: "48Gi"

- cron: '1 17 * * *' # Run daily at 9:01PST (17:01 UTC)
name: ci-kubernetes-e2e-gce-scale-performance-canary
tags:
- "perfDashPrefix: gce-5000Nodes"
- "perfDashBuildsCount: 270"
- "perfDashJobType: performance"
cluster: k8s-infra-prow-build
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
preset-e2e-scalability-common: "true"
preset-e2e-scalability-periodics: "true"
preset-e2e-scalability-periodics-master: "true"
decorate: true
decoration_config:
timeout: 450m
extra_refs:
- org: kubernetes
repo: kubernetes
base_ref: master
path_alias: k8s.io/kubernetes
- org: kubernetes
repo: perf-tests
base_ref: master
path_alias: k8s.io/perf-tests
annotations:
testgrid-dashboards: wg-k8s-infra-canaries
testgrid-tab-name: gce-master-scale-performance-canary
spec:
containers:
- image: gcr.io/k8s-testimages/kubekins-e2e:v20210601-ea6aa4e-master
command:
- runner.sh
- /workspace/scenarios/kubernetes_e2e.py
args:
- --cluster=gce-scale-cluster
- --env=HEAPSTER_MACHINE_TYPE=e2-standard-32
# TODO(mborsz): Adjust or remove this change once we understand coredns
# memory usage regression.
- --env=KUBE_DNS_MEMORY_LIMIT=300Mi
- --extract=ci/latest-fast
- --extract-ci-bucket=k8s-release-dev
- --gcp-nodes=5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. This should use the same project as correctness. What's the node type used here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a lot at a successful job of ci-kubernetes-e2e-gce-scale-performance: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1401585162642264064/build-log.txt.
The node type appears to be e2-standard-32.

- --gcp-project=k8s-infra-e2e-scale-5k-project
- --gcp-zone=us-east1-b
- --provider=gce
- --metadata-sources=cl2-metadata.json
- --env=CL2_LOAD_TEST_THROUGHPUT=50
- --env=CL2_DELETE_TEST_THROUGHPUT=30
- --env=CL2_ENABLE_HUGE_SERVICES=true
# Overrides CONTROLLER_MANAGER_TEST_ARGS from preset-e2e-scalability-periodics.
- --env=CONTROLLER_MANAGER_TEST_ARGS=--profiling --kube-api-qps=100 --kube-api-burst=100 --endpointslice-updates-batch-period=500ms --endpoint-updates-batch-period=500ms
- --env=CL2_ENABLE_API_AVAILABILITY_MEASUREMENT=true
- --env=CL2_API_AVAILABILITY_PERCENTAGE_THRESHOLD=99.5
- --test=false
- --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh
- --test-cmd-args=cluster-loader2
- --test-cmd-args=--experimental-gcp-snapshot-prometheus-disk=true
- --test-cmd-args=--experimental-prometheus-disk-snapshot-name=$(JOB_NAME)-$(BUILD_ID)
- --test-cmd-args=--nodes=5000
- --test-cmd-args=--prometheus-scrape-node-exporter
- --test-cmd-args=--provider=gce
- --test-cmd-args=--report-dir=$(ARTIFACTS)
- --test-cmd-args=--testconfig=testing/load/config.yaml
- --test-cmd-args=--testconfig=testing/access-tokens/config.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/enable_restart_count_check.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/ignore_known_gce_container_restarts.yaml
- --test-cmd-args=--testoverrides=./testing/overrides/5000_nodes.yaml
- --test-cmd-name=ClusterLoaderV2
- --timeout=420m
- --use-logexporter
- --logexporter-gcs-path=gs://k8s-infra-scalability-tests-logs/$(JOB_NAME)/$(BUILD_ID)
resources:
requests:
cpu: 6
memory: "16Gi"
limits:
cpu: 6
memory: "16Gi"

- name: ci-kubernetes-kubemark-gce-scale-canary
tags:
- "perfDashPrefix: kubemark-5000Nodes"
- "perfDashJobType: performance"
# Run twice a day (at 00:01 and 16:01 UTC) on the odd days of each month. The
# job is expected to take ~12-14h, hence the 16 hour gap.
cron: '1 0,16 1-31/2 * *'
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
preset-dind-enabled: "true"
preset-e2e-kubemark-common: "true"
preset-e2e-kubemark-gce-scale: "true"
preset-e2e-scalability-periodics: "true"
preset-e2e-scalability-periodics-master: "true"
decorate: true
decoration_config:
timeout: 1100m
extra_refs:
- org: kubernetes
repo: kubernetes
base_ref: master
path_alias: k8s.io/kubernetes
- org: kubernetes
repo: perf-tests
base_ref: master
path_alias: k8s.io/perf-tests
annotations:
testgrid-dashboards: wg-k8s-infra-canaries
testgrid-tab-name: kubemark-5000-canary
testgrid-num-columns-recent: '3'
spec:
containers:
- image: gcr.io/k8s-testimages/kubekins-e2e:v20210601-ea6aa4e-master
command:
- runner.sh
- /workspace/scenarios/kubernetes_e2e.py
args:
- --cluster=kubemark-5000
- --extract=ci/latest
- --gcp-node-image=gci
- --gcp-node-size=e2-standard-8
- --gcp-nodes=84
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not sure if this will need quota increase or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to increase the quota for this job. the smallest quota in us-east1 is 1250 (for all the projects with type scalability-project)

- --gcp-project=k8s-infra-e2e-scale-5k-project
- --gcp-zone=us-east1-b
- --kubemark
- --kubemark-nodes=5000
- --provider=gce
- --metadata-sources=cl2-metadata.json
- --test=false
- --test_args=--ginkgo.focus=xxxx
- --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh
- --test-cmd-args=cluster-loader2
- --test-cmd-args=--experimental-gcp-snapshot-prometheus-disk=true
- --test-cmd-args=--experimental-prometheus-disk-snapshot-name=$(JOB_NAME)-$(BUILD_ID)
- --test-cmd-args=--nodes=5000
- --test-cmd-args=--provider=kubemark
- --env=CL2_ENABLE_PVS=false # TODO(https://github.com/kubernetes/perf-tests/issues/803): Fix me
- --test-cmd-args=--report-dir=$(ARTIFACTS)
- --test-cmd-args=--testconfig=testing/load/config.yaml
- --test-cmd-args=--testconfig=testing/access-tokens/config.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/enable_restart_count_check.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/ignore_known_kubemark_container_restarts.yaml
- --test-cmd-args=--testoverrides=./testing/overrides/kubemark_5000_nodes.yaml
- --test-cmd-name=ClusterLoaderV2
- --timeout=1080m
- --use-logexporter
- --logexporter-gcs-path=gs://k8s-infra-scalability-tests-logs/$(JOB_NAME)/$(BUILD_ID)
# docker-in-docker needs privilged mode
securityContext:
privileged: true
resources:
requests:
cpu: 6
memory: "16Gi"
limits:
cpu: 6
memory: "16Gi"

- name: ci-kubernetes-kubemark-gce-scale-scheduler-canary
tags:
- "perfDashPrefix: kubemark-5000Nodes-scheduler"
- "perfDashJobType: performance"
# Run at 10:01 UTC on the even days of each month. There will be ample time
# between the kubemark-5000Nodes job (expected to start at 16:01 UTC the
# previous day and finish in around ~12-14 hours) and this job. This job is
# expected to take ~6-8 hours, which should allow it to finish well before
# the next kubemark-5000Nodes job (at 00:01 UTC).
cron: '1 10 2-31/2 * *'
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
preset-dind-enabled: "true"
preset-e2e-kubemark-common: "true"
preset-e2e-kubemark-gce-scale: "true"
preset-e2e-scalability-periodics: "true"
preset-e2e-scalability-periodics-master: "true"
decorate: true
decoration_config:
timeout: 1100m
extra_refs:
- org: kubernetes
repo: kubernetes
base_ref: master
path_alias: k8s.io/kubernetes
- org: kubernetes
repo: perf-tests
base_ref: master
path_alias: k8s.io/perf-tests
annotations:
testgrid-dashboards: wg-k8s-infra-canaries
testgrid-tab-name: kubemark-5000-scheduler-canaries
testgrid-num-columns-recent: '3'
spec:
containers:
- image: gcr.io/k8s-testimages/kubekins-e2e:v20210601-ea6aa4e-master
command:
- runner.sh
- /workspace/scenarios/kubernetes_e2e.py
args:
- --cluster=kubemark-5000
- --extract=ci/latest
- --gcp-node-image=gci
- --gcp-node-size=e2-standard-8
- --gcp-nodes=84
- --gcp-project=k8s-infra-e2e-scale-5k-project
- --gcp-zone=us-east1-b
- --kubemark
- --kubemark-nodes=5000
- --provider=gce
- --metadata-sources=cl2-metadata.json
- --test=false
- --test_args=--ginkgo.focus=xxxx
- --test-cmd=$GOPATH/src/k8s.io/perf-tests/run-e2e.sh
- --test-cmd-args=cluster-loader2
- --test-cmd-args=--experimental-gcp-snapshot-prometheus-disk=true
- --test-cmd-args=--experimental-prometheus-disk-snapshot-name=$(JOB_NAME)-$(BUILD_ID)
- --test-cmd-args=--nodes=5000
- --test-cmd-args=--provider=kubemark
- --env=CL2_ENABLE_PVS=false # TODO(https://github.com/kubernetes/perf-tests/issues/803): Fix me
- --test-cmd-args=--report-dir=$(ARTIFACTS)
- --test-cmd-args=--testsuite=testing/density/scheduler-suite.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/enable_restart_count_check.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/ignore_known_kubemark_container_restarts.yaml
- --test-cmd-args=--testoverrides=./testing/overrides/kubemark_5000_nodes.yaml
- --test-cmd-name=ClusterLoaderV2
- --timeout=1080m
- --use-logexporter
- --logexporter-gcs-path=gs://k8s-infra-scalability-tests-logs/$(JOB_NAME)/$(BUILD_ID)
# docker-in-docker needs privilged mode
securityContext:
privileged: true
resources:
requests:
cpu: 6
memory: "16Gi"

- interval: 4h
name: ci-golang-tip-k8s-1-18-canary
cluster: k8s-infra-prow-build
tags:
- "perfDashPrefix: golang-tip-k8s-1-18"
- "perfDashBuildsCount: 240"
- "perfDashJobType: performance"
labels:
preset-service-account: "true"
preset-k8s-ssh: "true"
preset-dind-enabled: "true"
preset-e2e-kubemark-common: "true"
preset-e2e-kubemark-gce-scale: "true"
preset-e2e-scalability-periodics: "true"
preset-e2e-scalability-periodics-master: "true"
decorate: true
decoration_config:
timeout: 210m
extra_refs:
- org: kubernetes
repo: perf-tests
base_ref: master
base_sha: 39a6c09ddca620a430d38e5de1400844ea954c2f # head of perf-tests' master as of 2020-11-06
path_alias: k8s.io/perf-tests
annotations:
testgrid-dashboards: wg-k8s-infra-canaries
testgrid-tab-name: ci-golang-tip-k8s-1-18-canary
spec:
containers:
- image: gcr.io/k8s-testimages/kubekins-e2e:v20210601-ea6aa4e-master
command:
- runner.sh
- /workspace/scenarios/kubernetes_e2e.py
args:
- --cluster=gce-golang
- --env=CL2_ENABLE_PVS=false
- --env=CL2_LOAD_TEST_THROUGHPUT=50
- --env=KUBEMARK_CONTROLLER_MANAGER_TEST_ARGS=--profiling --kube-api-qps=200 --kube-api-burst=200
- --env=KUBEMARK_SCHEDULER_TEST_ARGS=--profiling --kube-api-qps=200 --kube-api-burst=200
- --extract=gs://k8s-scale-golang-build/ci/latest-1.18.txt
- --gcp-node-size=e2-standard-8
- --gcp-nodes=50
- --gcp-project=k8s-infra-e2e-scale-5k-project
- --gcp-zone=us-east1-b
- --provider=gce
- --kubemark
- --kubemark-nodes=2500
- --test=false
- --test-cmd=$GOPATH/src/k8s.io/perf-tests/golang/run-e2e.sh
- --test-cmd-args=cluster-loader2
- --test-cmd-args=--experimental-gcp-snapshot-prometheus-disk=true
- --test-cmd-args=--experimental-prometheus-disk-snapshot-name=$(JOB_NAME)-$(BUILD_ID)
- --test-cmd-args=--nodes=2500
- --test-cmd-args=--provider=kubemark
- --test-cmd-args=--report-dir=$(ARTIFACTS)
- --test-cmd-args=--testconfig=testing/load/config.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/enable_restart_count_check.yaml
- --test-cmd-args=--testoverrides=./testing/experiments/ignore_known_kubemark_container_restarts.yaml
- --test-cmd-args=--testoverrides=./testing/overrides/5000_nodes.yaml
- --test-cmd-args=--testoverrides=./testing/load/golang/custom_api_call_thresholds.yaml
- --test-cmd-name=ClusterLoaderV2
- --timeout=180m
- --use-logexporter
- --logexporter-gcs-path=gs://k8s-infra-scalability-tests-logs/$(JOB_NAME)/$(BUILD_ID)
env:
- name: CL2_ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS
value: "true"
# docker-in-docker needs privilged mode
securityContext:
privileged: true
resources:
requests:
cpu: 6
memory: "16Gi"
limits:
cpu: 6
memory: "16Gi"
2 changes: 2 additions & 0 deletions config/testgrids/kubernetes/wg-k8s-infa/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ dashboard_groups:
- name: wg-k8s-infra
dashboard_names:
- wg-k8s-infra-k8sio
- wg-k8s-infra-canaries

dashboards:
- name: wg-k8s-infra-k8sio
- name: wg-k8s-infra-canaries