Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic high availability for auto egress IPs #19578

Merged

Conversation

danwinship
Copy link
Contributor

OK, this changes the semantics of automatic egress IPs so that:

  • You are now allowed to specify multiple IPs in the EgressIPs field of a NetNamespace, and if so, the first one in the list that is currently hosted on a node will be used.
    • However, if any of the IPs in the list are duplicates of IPs used by another namespace, then the namespace will be considered broken (and egress traffic will be dropped). Previously this rule only applied if the first IP was a duplicate, since further IPs were simply ignored.
  • If a namespace specifies multiple egress IPs, and the node hosting its current egress IP stops responding, then nodes will automatically switch to using the namespace's next available egress IP. (Each node does this independently, when it notices that packets are being sent but not responded to.)

I tried to do this without changing the existing egressip code, but the existing code was based on the idea of processing each egress IP independently, and that doesn't work very well when changes to one egress IP can affect whether another egress IP is used. So I had to do some reorganization.

Some things to fix later (either before this gets merged, or in followup PRs, or in future patch releases):

  • Make the poll times configurable
  • Be more proactive about noticing bad egress nodes, rather than only noticing them after someone has tried to use them. (IOW, ping all the nodes all the time.)
  • Don't do liveness checking of nodes if there are no HA-able egress IPs on them. (IOW, don't ping all the nodes all the time.) Eg, in existing clusters where every namespace has only a single EgressIP, there's no need to do liveness checking, since there are no backup nodes to redirect namespaces to if their primary egress nodes fail.

@openshift/sig-networking PTAL

@danwinship danwinship added component/networking sig/networking do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. labels May 1, 2018
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 1, 2018
@knobunc knobunc self-assigned this May 2, 2018
@imcsk8
Copy link
Contributor

imcsk8 commented May 3, 2018

LGTM

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift/networking PTAL

}

const (
defaultPollInterval = 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we doc these constants please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the egressVXLANMonitor docs to clarify how the constants get used

// The fact that we (normally) use pod-to-egress traffic to do the monitoring rather than
// actively pinging the nodes means that if an egress node falls over while no one is
// using an egress IP on it, we won't notice the problem until someone does try to use it.
// So, eg, the first pod created in a namespace might spend 5 seconds trying to talk to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's (max) 7 seconds... right? Or am I missing something.

Every pollInterval (5s) we check the traffic to each node. If there's a mismatch we go into retry mode which re-checks every 1s and call it failed after 2 retries. So we will call it bad in best case 2s, worst case 7s, and average 4.5s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Changed it from "5" to "several"

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2018
nodeIP: nodeIP,
sdnIP: sdnIP,
}
if len(evm.monitorNodes) == 1 && evm.pollInterval != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth turning off polling in RemoveNode()? Maybe not, just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I was doing it from poll() (it returned true if evm.monitorNodes was empty, causing the PollInfinite() to exit). But that creates a race condition if you delete the last node, then add a new node before poll() runs. So I redid it with a stop channel.

Copy link
Contributor

@dcbw dcbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 23, 2018
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2018
If a namespace has multiple egress IPs, monitor egress traffic and
switch to an alternate egress IP if the currently-selected one appears
dead.
Most dump-flows calls are part of health checks and don't normally
need to be logged about unless they fail.
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2018
@danwinship
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 27, 2018
@danwinship
Copy link
Contributor Author

/retest

@danwinship
Copy link
Contributor Author

@dcbw @knobunc needs re-lgtm after having been rebased around #19603 and #19926

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2018
Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw, knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@danwinship
Copy link
Contributor Author

/retest

1 similar comment
@danwinship
Copy link
Contributor Author

/retest

@danwinship
Copy link
Contributor Author

/retest

2 similar comments
@danwinship
Copy link
Contributor Author

/retest

@danwinship
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit d0eff76 into openshift:master Jul 1, 2018
@danwinship danwinship deleted the auto-egress-ip-ha branch July 20, 2018 13:07
for _, ip := range eg.namespaces[0].requestedIPs {
eg2 := eip.egressIPs[ip]
if eg2 != eg && len(eg2.nodes) == 1 && eg2.nodes[0] == eg.nodes[0] {
return false, fmt.Errorf("Multiple EgressIPs (%s, %s) for VNID %d on node %s", eg.ip, eg2.ip, eg.namespaces[0].vnid, eg.nodes[0].nodeIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship What is the reason to make the egressIP unavailable when all the egressIPs which are claimed by namespace are held by the same node? IMO, this condition will break the HA, but should be harmless if only let the first element work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmeng If you set multiple egress IPs on a NetNamespace, then you are expecting to get high availability. If the IPs are both on the same node though, then you aren't actually getting any HA. So, as in other cases, we break things to make the admin notice the misconfiguration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/networking kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. sig/networking size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants