kube-proxy iptables partial restore failures #2182

aojea · 2022-11-07T00:56:00Z

kube-proxy implemented a new feature to do only partial restore of iptables rules to improve its performance.

The feature adds a new metrics that signals the operator if something goes wrong during the partial resttore. If this metrics is different than zero it most probably a bug in the code.

/kind feature

Ref: https://github.com/kubernetes/enhancements/pull/3454/files#r979617744

aojea · 2022-11-07T00:57:45Z

/hold

I need help with the unit test, I don't know if the query is wrong or what happen but the executor in the test doesn't get samples.
However, I can do sum(kubeproxy_sync_proxy_rules_iptables_partial_restore_failures_total)in a local cluster to obtain the total of that metrics. It is also documented in the prometheus documentation

/assign @wojtek-t @marseel
/cc @danwinship

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go

marseel · 2022-11-07T09:23:59Z

Could you take a look, please?
/assign @Argh4k

k8s-ci-robot · 2022-11-07T09:24:01Z

@marseel: GitHub didn't allow me to assign the following users: Argh4k.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Could you take a look, please?
/assign @Argh4k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go

danwinship · 2022-11-07T17:06:26Z

I don't know prometheus well, but I'll just point out that you also need to set the MinimizeIPTablesRestore feature gate when starting kube-proxy or the metric is guaranteed to be 0 anyway. (The metric itself is not feature-gated, so it will exist, it's just that it would be never be non-0.)

aojea · 2022-11-07T18:43:23Z

I don't know prometheus well, but I'll just point out that you also need to set the MinimizeIPTablesRestore feature gate when starting kube-proxy or the metric is guaranteed to be 0 anyway. (The metric itself is not feature-gated, so it will exist, it's just that it would be never be non-0.)

/hold

ah, good point, how can I do so this test only runs when this featuregate is set?

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go

aojea · 2022-11-08T12:09:22Z

/hold cancel

PTAL

danwinship · 2022-11-08T14:00:39Z

ah, good point, how can I do so this test only runs when this featuregate is set?

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

But anyway, if the feature gate is disabled, then the metric is guaranteed to be 0, so the test will always pass, so it doesn't matter that much if you just always run it...

aojea · 2022-11-08T14:21:32Z

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

I wasn't thinking that further, I just wanted to have a test that only fail when it should ... now that we have that we have to figure out how to set the feature gate on the jobs 😄

mborsz · 2022-11-08T14:42:33Z

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

I wasn't thinking that further, I just wanted to have a test that only fail when it should ... now that we have that we have to figure out how to set the feature gate on the jobs 😄

If you want to run one-off, then probably approach like in kubernetes/kubernetes#113695 is prefered -- you open PR with artificial Feature Gate change and run presubmit there

If you want to run periodics, you can set KUBE_FEATURE_GATES var for CI jobs: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L133

aojea · 2022-11-08T14:52:01Z

If you want to run periodics, you can set KUBE_FEATURE_GATES var for CI jobs: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L133

I think that in the KEP we kind of agree to run periodics so we can have enough signal for graduation

aojea · 2022-11-08T14:59:07Z

I think that this will make it kubernetes/test-infra#27927

aojea · 2022-11-08T16:51:24Z

is the error related? I can't see how

F1108 15:01:44.087597   29265 clusterloader.go:303] Cluster verification error: no schedulable nodes in the cluster
2022/11/08 15:01:44 process.go:155: Step '/home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=100 --prometheus-scrape-node-exporter --provider=kubemark --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/use_simple_latency_query.yaml --testoverrides=./testing/overrides/kubemark_load_throughput.yaml' finished in 3m33.769282639s
2022/11/08 15:01:44 e2e.go:776: Dumping logs for kubemark master to GCS directly at path: gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208
2022/11/08 15:01:44 process.go:153: Running: /workspace/log-dump.sh /logs/artifacts gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208
Checking for custom logdump instances, if any
Using gce provider, skipping check for LOG_DUMP_SSH_KEY and LOG_DUMP_SSH_USER
Project: k8s-presubmit-scale-6
Network Project: k8s-presubmit-scale-6
Zone: us-east1-b
Dumping logs temporarily to '/tmp/tmp.4xBe9oJ531/logs'. Will upload to 'gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208' later.
Dumping logs from master locally to '/tmp/tmp.4xBe9oJ531/logs'
Changing logfiles to be world-readable for download
Copying 'kube-apiserver.log kube-apiserver-audit.log kube-scheduler.log kube-controller-manager.log etcd.log etcd-events.log glbc.log cluster-autoscaler.log kube-addon-manager.log konnectivity-server.log fluentd.log kubelet.cov startupscript.log' from e2e-2182-0bf8a-kubemark-master
Specify --start=74404 in the next get-serial-port-output invocation to get only the new output starting from here.
scp: /var/log/glbc.log*: No such file or directory
scp: /var/log/cluster-autoscaler.log*: No such file or directory
scp: /var/log/konnectivity-server.log*: No such file or directory
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

aojea · 2022-11-18T08:05:30Z

/retest

aojea · 2022-11-18T09:49:18Z

@mborsz can you review again?

mborsz · 2022-11-18T09:59:07Z

/lgtm
/approve

Thanks, looks good!

k8s-ci-robot · 2022-11-18T09:59:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, mborsz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~clusterloader2/OWNERS~~ [mborsz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 7, 2022

k8s-ci-robot requested review from krzysied and mborsz November 7, 2022 00:56

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 7, 2022

k8s-ci-robot assigned marseel Nov 7, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2022

k8s-ci-robot assigned wojtek-t Nov 7, 2022

k8s-ci-robot requested a review from danwinship November 7, 2022 00:57

aojea force-pushed the kproxy_iptables_fail branch from 7ccdf66 to 9fe2360 Compare November 7, 2022 00:59

aojea commented Nov 7, 2022

View reviewed changes

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go Outdated Show resolved Hide resolved

aojea commented Nov 7, 2022

View reviewed changes

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go Outdated Show resolved Hide resolved

Argh4k reviewed Nov 7, 2022

View reviewed changes

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go Outdated Show resolved Hide resolved

aojea force-pushed the kproxy_iptables_fail branch 2 times, most recently from 39138a5 to e17d533 Compare November 7, 2022 15:34

mborsz reviewed Nov 7, 2022

View reviewed changes

clusterloader2/pkg/measurement/common/kube_proxy_iptables_restore_failures.go Outdated Show resolved Hide resolved

test kube-proxy partial iptables restore failures doesn't happen

7ab5384

aojea force-pushed the kproxy_iptables_fail branch from e17d533 to 7ab5384 Compare November 8, 2022 12:08

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2022

k8s-ci-robot assigned mborsz Nov 18, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2022

k8s-ci-robot merged commit d677b5c into kubernetes:master Nov 18, 2022

danwinship mentioned this pull request Nov 28, 2022

Minimizing iptables-restore input size kubernetes/enhancements#3453

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy iptables partial restore failures #2182

kube-proxy iptables partial restore failures #2182

aojea commented Nov 7, 2022

aojea commented Nov 7, 2022

marseel commented Nov 7, 2022

k8s-ci-robot commented Nov 7, 2022

danwinship commented Nov 7, 2022

aojea commented Nov 7, 2022

aojea commented Nov 8, 2022

danwinship commented Nov 8, 2022

aojea commented Nov 8, 2022

mborsz commented Nov 8, 2022

aojea commented Nov 8, 2022 •

edited

Loading

aojea commented Nov 8, 2022

aojea commented Nov 8, 2022

aojea commented Nov 18, 2022

aojea commented Nov 18, 2022

mborsz commented Nov 18, 2022

k8s-ci-robot commented Nov 18, 2022

kube-proxy iptables partial restore failures #2182

kube-proxy iptables partial restore failures #2182

Conversation

aojea commented Nov 7, 2022

aojea commented Nov 7, 2022

marseel commented Nov 7, 2022

k8s-ci-robot commented Nov 7, 2022

danwinship commented Nov 7, 2022

aojea commented Nov 7, 2022

aojea commented Nov 8, 2022

danwinship commented Nov 8, 2022

aojea commented Nov 8, 2022

mborsz commented Nov 8, 2022

aojea commented Nov 8, 2022 • edited Loading

aojea commented Nov 8, 2022

aojea commented Nov 8, 2022

aojea commented Nov 18, 2022

aojea commented Nov 18, 2022

mborsz commented Nov 18, 2022

k8s-ci-robot commented Nov 18, 2022

aojea commented Nov 8, 2022 •

edited

Loading