Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap temporary IPSets during ipset restore #1068

Merged
merged 4 commits into from
Apr 20, 2021

Conversation

bnu0
Copy link
Contributor

@bnu0 bnu0 commented Apr 15, 2021

The problem.

There is a bit of a bug in kube-router, which only seems to exist since 1.2.x — since the implementation was changed to use iptables-restore and friends. Essentially, the generated IPSets are observably empty/incomplete for a brief moment between the flush and the completion of all the adds provided to ipset restore. This causes inbound or outbound connection failures for ingress/egress networkpolicies, respectively.

After some discussion in slack, here's a PR.

A Solution?

This PR uses an approach discussed on the netfilter mailing list to avoid this problem by creating new sets and swapping them into place.

  1. a temporary set is created of the same type as the target set. The name is generated deterministically (base32(sha256(...))) from the target's name.
  2. the temporary set is populated to be in the desired state of the target set.
  3. the two are atomically swapped.
  4. the temporary set (whose name now refers to the old target set) is deleted.

An example:

create TMP-5FIQLNFPVGLJPQU5 hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.1 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.2 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.3 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.4 timeout 0
add TMP-5FIQLNFPVGLJPQU5 110.120.130.5 timeout 0
create KUBE-XYZ-EXAMPLE hash:ip family inet hashsize 1024 maxelem 65536 timeout 0
swap TMP-5FIQLNFPVGLJPQU5 KUBE-XYZ-EXAMPLE
destroy TMP-5FIQLNFPVGLJPQU5

Example

If you have pods with type=ingress NetworkPolicy, for example, then

  • when Kube-router is synchronizing,
  • there is a small window where clients will get an erroneous connection refused error.
  • At that moment, their traffic enters the POD-FW chain,
  • jumps into the NWPLCY chain,
  • is unmarked because their client IP is not in the IPset due to this race, and as such:
  • logged to nflog:100 and
  • -j REJECT-ed.

Reproduction

You can reproduce this fairly easily by launching a bunch of short-lived pods which have a netpol acting on them, while simultaneously making network requests in a loop to some long-lived pod that also has a netpol (same or different one) acting on it. The short-lived pods only exist to repeatedly trigger the synchronization loop in kube-router, which will momentarily remove all IPSet members from the long-lived pod's ipsets. Your loop will sometimes catch it in this state and get a connection refused error. If you are monitoring nflog:100, you will see it there too.

Copy link
Collaborator

@aauren aauren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, otherwise everything looks good here. I'll try to try it out in one of our clusters today.

pkg/utils/ipset.go Outdated Show resolved Hide resolved
@murali-reddy
Copy link
Member

LGTM as well. Aaron, as you plan to test I will skip testing part.

thanks @bnu0 for the PR

@aauren
Copy link
Collaborator

aauren commented Apr 15, 2021

Tested this on one of our nodes in a dev environment. With ~450 ipsets, this takes the execution of ipset-restore from ~1 second to ~47 seconds which puts NetworkPolicyController sync time back within the range of what we had with kube-router 1.1.X.

I agree with the problem expressed in this PR, we definitely need to find a way to fix it, but I think that this specific solution introduces too much of a performance regression. Essentially, it triples the number of ipset operations which the kernel doesn't seem to handle well.

Metrics for test (kube_router_controller_iptables_sync_total_time/kube_router_controller_iptables_sync_total_count)
image

@aauren
Copy link
Collaborator

aauren commented Apr 15, 2021

Further troubleshooting shows that the increase is only because of how many more destroy operations we have in this new method of swapping.

If you remove or at least greatly reduce the number of destroy operations then it brings the ipset restore time back down to previous levels.

I would recommend that we only create 1 TMP ipset per ipset type and then re-use them throughout the restore, and only destroy them at the end.

@bnu0
Copy link
Contributor Author

bnu0 commented Apr 16, 2021

I deployed this updated commit to one of our clusters:

image

with reuse of tmp ipsets, iptables sync time dropped from ~450ms to ~235ms at p90.

to put this in comparison from v1.1.x:

image

ipSetRestore.WriteString(fmt.Sprintf("create %s %s\n", tmpSetName, setOptions))
tmpSets[setOptions] = tmpSetName
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a flush here as well or just do it here? as else part if the set exists? Just to ensure we start clean. For e.g. to avoide cases like kube-router starting after crash which may have left stale entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in e731d80

@murali-reddy
Copy link
Member

thanks @bnu0 for the PR and promot follow up. LGTM

@murali-reddy murali-reddy merged commit 18d2a3b into cloudnativelabs:master Apr 20, 2021
aauren pushed a commit that referenced this pull request Apr 20, 2021
* ipset restore: use temporary sets and swap them into the real ones

* move const

* switch to shared tmp ipsets

* preemptively flush tmp set in case it already existed
@aauren
Copy link
Collaborator

aauren commented Apr 20, 2021

@bnu0 Thanks for the help!

I double-checked this version again after your patch and the performance improvement holds.

I back-ported your fix to the v1.2 branch and this is now part of the bug fix release v1.2.2 that was just made.

bnu0 pushed a commit to bnu0/kube-router that referenced this pull request May 4, 2021
aauren pushed a commit that referenced this pull request May 13, 2021
aauren pushed a commit that referenced this pull request Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants