Swap temporary IPSets during ipset restore #1068

bnu0 · 2021-04-15T02:03:53Z

The problem.

There is a bit of a bug in kube-router, which only seems to exist since 1.2.x — since the implementation was changed to use iptables-restore and friends. Essentially, the generated IPSets are observably empty/incomplete for a brief moment between the flush and the completion of all the adds provided to ipset restore. This causes inbound or outbound connection failures for ingress/egress networkpolicies, respectively.

After some discussion in slack, here's a PR.

A Solution?

This PR uses an approach discussed on the netfilter mailing list to avoid this problem by creating new sets and swapping them into place.

a temporary set is created of the same type as the target set. The name is generated deterministically (base32(sha256(...))) from the target's name.
the temporary set is populated to be in the desired state of the target set.
the two are atomically swapped.
the temporary set (whose name now refers to the old target set) is deleted.

An example:

create TMP-5FIQLNFPVGLJPQU5 hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.1 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.2 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.3 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.4 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.5 timeout 0
create KUBE-XYZ-EXAMPLE hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
swap TMP-5FIQLNFPVGLJPQU5 KUBE-XYZ-EXAMPLE
destroy TMP-5FIQLNFPVGLJPQU5

Example

If you have pods with type=ingress NetworkPolicy, for example, then

when Kube-router is synchronizing,
there is a small window where clients will get an erroneous connection refused error.
At that moment, their traffic enters the POD-FW chain,
jumps into the NWPLCY chain,
is unmarked because their client IP is not in the IPset due to this race, and as such:
logged to nflog:100 and
-j REJECT-ed.

Reproduction

You can reproduce this fairly easily by launching a bunch of short-lived pods which have a netpol acting on them, while simultaneously making network requests in a loop to some long-lived pod that also has a netpol (same or different one) acting on it. The short-lived pods only exist to repeatedly trigger the synchronization loop in kube-router, which will momentarily remove all IPSet members from the long-lived pod's ipsets. Your loop will sometimes catch it in this state and get a connection refused error. If you are monitoring nflog:100, you will see it there too.

aauren

One small nit, otherwise everything looks good here. I'll try to try it out in one of our clusters today.

pkg/utils/ipset.go

murali-reddy · 2021-04-15T19:27:05Z

LGTM as well. Aaron, as you plan to test I will skip testing part.

thanks @bnu0 for the PR

aauren · 2021-04-15T21:45:49Z

Tested this on one of our nodes in a dev environment. With ~450 ipsets, this takes the execution of ipset-restore from ~1 second to ~47 seconds which puts NetworkPolicyController sync time back within the range of what we had with kube-router 1.1.X.

I agree with the problem expressed in this PR, we definitely need to find a way to fix it, but I think that this specific solution introduces too much of a performance regression. Essentially, it triples the number of ipset operations which the kernel doesn't seem to handle well.

Metrics for test (kube_router_controller_iptables_sync_total_time/kube_router_controller_iptables_sync_total_count)

aauren · 2021-04-15T22:36:19Z

Further troubleshooting shows that the increase is only because of how many more destroy operations we have in this new method of swapping.

If you remove or at least greatly reduce the number of destroy operations then it brings the ipset restore time back down to previous levels.

I would recommend that we only create 1 TMP ipset per ipset type and then re-use them throughout the restore, and only destroy them at the end.

bnu0 · 2021-04-16T19:21:23Z

I deployed this updated commit to one of our clusters:

with reuse of tmp ipsets, iptables sync time dropped from ~450ms to ~235ms at p90.

to put this in comparison from v1.1.x:

murali-reddy · 2021-04-16T19:44:05Z

pkg/utils/ipset.go

+			ipSetRestore.WriteString(fmt.Sprintf("create %s %s\n", tmpSetName, setOptions))
+			tmpSets[setOptions] = tmpSetName
+		}
+


Can we have a flush here as well or just do it here? as else part if the set exists? Just to ensure we start clean. For e.g. to avoide cases like kube-router starting after crash which may have left stale entries.

sure, makes sense

fixed in e731d80

murali-reddy · 2021-04-20T14:21:48Z

thanks @bnu0 for the PR and promot follow up. LGTM

* ipset restore: use temporary sets and swap them into the real ones * move const * switch to shared tmp ipsets * preemptively flush tmp set in case it already existed

aauren · 2021-04-20T15:50:35Z

@bnu0 Thanks for the help!

I double-checked this version again after your patch and the performance improvement holds.

I back-ported your fix to the v1.2 branch and this is now part of the bug fix release v1.2.2 that was just made.

ipset restore: use temporary sets and swap them into the real ones

f114df9

bnu0 force-pushed the bnu-ipset-swapping branch from 879bae1 to f114df9 Compare April 15, 2021 02:08

aauren approved these changes Apr 15, 2021

View reviewed changes

pkg/utils/ipset.go Outdated Show resolved Hide resolved

move const

ad24d04

switch to shared tmp ipsets

b0180db

murali-reddy reviewed Apr 16, 2021

View reviewed changes

preemptively flush tmp set in case it already existed

e731d80

murali-reddy merged commit 18d2a3b into cloudnativelabs:master Apr 20, 2021

bnu0 pushed a commit to bnu0/kube-router that referenced this pull request May 4, 2021

Oops, i guess my unit test was flaky in cloudnativelabs#1068

46ad438

bnu0 mentioned this pull request May 4, 2021

Make pod firewall nflog rate-limiting configurable #1078

Closed

aauren pushed a commit that referenced this pull request May 13, 2021

Oops, i guess my unit test was flaky in #1068

fce90b0

aauren pushed a commit that referenced this pull request Jun 1, 2021

Oops, i guess my unit test was flaky in #1068

c139c04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap temporary IPSets during ipset restore #1068

Swap temporary IPSets during ipset restore #1068

bnu0 commented Apr 15, 2021 •

edited

Loading

aauren left a comment

murali-reddy commented Apr 15, 2021

aauren commented Apr 15, 2021

aauren commented Apr 15, 2021

bnu0 commented Apr 16, 2021

murali-reddy Apr 16, 2021

bnu0 Apr 16, 2021

bnu0 Apr 16, 2021

murali-reddy commented Apr 20, 2021

aauren commented Apr 20, 2021

Swap temporary IPSets during ipset restore #1068

Swap temporary IPSets during ipset restore #1068

Conversation

bnu0 commented Apr 15, 2021 • edited Loading

The problem.

A Solution?

Example

Reproduction

aauren left a comment

Choose a reason for hiding this comment

murali-reddy commented Apr 15, 2021

aauren commented Apr 15, 2021

aauren commented Apr 15, 2021

bnu0 commented Apr 16, 2021

murali-reddy Apr 16, 2021

Choose a reason for hiding this comment

bnu0 Apr 16, 2021

Choose a reason for hiding this comment

bnu0 Apr 16, 2021

Choose a reason for hiding this comment

murali-reddy commented Apr 20, 2021

aauren commented Apr 20, 2021

bnu0 commented Apr 15, 2021 •

edited

Loading