Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy iptables partial restore failures #2182

Merged
merged 1 commit into from
Nov 18, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 7, 2022

kube-proxy implemented a new feature to do only partial restore of iptables rules to improve its performance.

The feature adds a new metrics that signals the operator if something goes wrong during the partial resttore. If this metrics is different than zero it most probably a bug in the code.

/kind feature

Ref: https://github.com/kubernetes/enhancements/pull/3454/files#r979617744

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 7, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 7, 2022
@aojea
Copy link
Member Author

aojea commented Nov 7, 2022

/hold

I need help with the unit test, I don't know if the query is wrong or what happen but the executor in the test doesn't get samples.
However, I can do sum(kubeproxy_sync_proxy_rules_iptables_partial_restore_failures_total)in a local cluster to obtain the total of that metrics. It is also documented in the prometheus documentation

/assign @wojtek-t @marseel
/cc @danwinship

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2022
@aojea aojea force-pushed the kproxy_iptables_fail branch from 7ccdf66 to 9fe2360 Compare November 7, 2022 00:59
@marseel
Copy link
Member

marseel commented Nov 7, 2022

Could you take a look, please?
/assign @Argh4k

@k8s-ci-robot
Copy link
Contributor

@marseel: GitHub didn't allow me to assign the following users: Argh4k.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Could you take a look, please?
/assign @Argh4k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea aojea force-pushed the kproxy_iptables_fail branch 2 times, most recently from 39138a5 to e17d533 Compare November 7, 2022 15:34
@danwinship
Copy link

I don't know prometheus well, but I'll just point out that you also need to set the MinimizeIPTablesRestore feature gate when starting kube-proxy or the metric is guaranteed to be 0 anyway. (The metric itself is not feature-gated, so it will exist, it's just that it would be never be non-0.)

@aojea
Copy link
Member Author

aojea commented Nov 7, 2022

I don't know prometheus well, but I'll just point out that you also need to set the MinimizeIPTablesRestore feature gate when starting kube-proxy or the metric is guaranteed to be 0 anyway. (The metric itself is not feature-gated, so it will exist, it's just that it would be never be non-0.)

/hold

ah, good point, how can I do so this test only runs when this featuregate is set?

@aojea aojea force-pushed the kproxy_iptables_fail branch from e17d533 to 7ab5384 Compare November 8, 2022 12:08
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2022
@aojea
Copy link
Member Author

aojea commented Nov 8, 2022

/hold cancel

PTAL

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2022
@danwinship
Copy link

ah, good point, how can I do so this test only runs when this featuregate is set?

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

But anyway, if the feature gate is disabled, then the metric is guaranteed to be 0, so the test will always pass, so it doesn't matter that much if you just always run it...

@aojea
Copy link
Member Author

aojea commented Nov 8, 2022

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

I wasn't thinking that further, I just wanted to have a test that only fail when it should ... now that we have that we have to figure out how to set the feature gate on the jobs 😄

@mborsz
Copy link
Member

mborsz commented Nov 8, 2022

Well, I was more concerned in the other direction; you need to set the feature gate in the test run before this test is even useful.

I wasn't thinking that further, I just wanted to have a test that only fail when it should ... now that we have that we have to figure out how to set the feature gate on the jobs 😄

If you want to run one-off, then probably approach like in kubernetes/kubernetes#113695 is prefered -- you open PR with artificial Feature Gate change and run presubmit there

If you want to run periodics, you can set KUBE_FEATURE_GATES var for CI jobs: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L133

@aojea
Copy link
Member Author

aojea commented Nov 8, 2022

If you want to run periodics, you can set KUBE_FEATURE_GATES var for CI jobs: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml#L133

I think that in the KEP we kind of agree to run periodics so we can have enough signal for graduation

@aojea
Copy link
Member Author

aojea commented Nov 8, 2022

I think that this will make it kubernetes/test-infra#27927

@aojea
Copy link
Member Author

aojea commented Nov 8, 2022

is the error related? I can't see how

F1108 15:01:44.087597   29265 clusterloader.go:303] Cluster verification error: no schedulable nodes in the cluster
2022/11/08 15:01:44 process.go:155: Step '/home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=100 --prometheus-scrape-node-exporter --provider=kubemark --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/use_simple_latency_query.yaml --testoverrides=./testing/overrides/kubemark_load_throughput.yaml' finished in 3m33.769282639s
2022/11/08 15:01:44 e2e.go:776: Dumping logs for kubemark master to GCS directly at path: gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208
2022/11/08 15:01:44 process.go:153: Running: /workspace/log-dump.sh /logs/artifacts gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208
Checking for custom logdump instances, if any
Using gce provider, skipping check for LOG_DUMP_SSH_KEY and LOG_DUMP_SSH_USER
Project: k8s-presubmit-scale-6
Network Project: k8s-presubmit-scale-6
Zone: us-east1-b
Dumping logs temporarily to '/tmp/tmp.4xBe9oJ531/logs'. Will upload to 'gs://sig-scalability-logs/pull-perf-tests-clusterloader2-kubemark/1589991412294750208' later.
Dumping logs from master locally to '/tmp/tmp.4xBe9oJ531/logs'
Changing logfiles to be world-readable for download
Copying 'kube-apiserver.log kube-apiserver-audit.log kube-scheduler.log kube-controller-manager.log etcd.log etcd-events.log glbc.log cluster-autoscaler.log kube-addon-manager.log konnectivity-server.log fluentd.log kubelet.cov startupscript.log' from e2e-2182-0bf8a-kubemark-master
Specify --start=74404 in the next get-serial-port-output invocation to get only the new output starting from here.
scp: /var/log/glbc.log*: No such file or directory
scp: /var/log/cluster-autoscaler.log*: No such file or directory
scp: /var/log/konnectivity-server.log*: No such file or directory
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].

@aojea
Copy link
Member Author

aojea commented Nov 18, 2022

/retest

@aojea
Copy link
Member Author

aojea commented Nov 18, 2022

@mborsz can you review again?

@mborsz
Copy link
Member

mborsz commented Nov 18, 2022

/lgtm
/approve

Thanks, looks good!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, mborsz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2022
@k8s-ci-robot k8s-ci-robot merged commit d677b5c into kubernetes:master Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants