Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RIP routes marked inactive and not being replaced #5174

Open
seanfulton opened this issue Oct 16, 2019 · 9 comments
Open

RIP routes marked inactive and not being replaced #5174

seanfulton opened this issue Oct 16, 2019 · 9 comments
Labels
rip triage Needs further investigation

Comments

@seanfulton
Copy link

seanfulton commented Oct 16, 2019

We are using FRR RPM frr-7.0-01.el6.x86_64 on CENTOS 6. We've used Quagga up until about a month ago with no problems but upgraded to FRR. Since then we've noticed that machines will randomly lose their default route. When I examine the routing table, I'll see the default route marked as a RIP route but inactive.

This seems similar to: #4535

About our network: We have two border routers running zebra. Each gets a default route via BGP and advertises it to the network using RIP. We have a static IP (#.#.#.254) that floats from router to router that non-RIP devices can use as a default GW.

When the hang occurs, I see this:

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 06:53:00
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 00:10:24

If I restart FRR, it immediately picks up a new default via RIP from 10.10.1.1 or 10.10.2.1, depending.

So my theory is that something causes the .254 address to flip over from say router A to router B.

My feeling is that if this .254 address becomes inactive, it should be flushed from the routing table and a new route gained from rip for either 10.10.1.1 or 10.10.1.2. Instead, the old route hangs.

Any idea why?

ripd.conf:

log file /var/log/zebra.log
!debug rip events
!debug rip zebra
!debug rip packet

!
interface bond0
ip rip split-horizon
no ip rip authentication mode
!
interface bond1
ip rip split-horizon
no ip rip authentication mode
!


router rip
version 2
timers basic 15 30 30
redistribute kernel 
no redistribute connected
no redistribute static


network 74.201.36.0/22
network 74.201.40.0/22
network 172.81.88.0/22
network 10.0.0.0/8


line vty

zebra.conf:

!
interface bond0
 ip address 10.10.1.25/24
 description "Primary LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface bond1
 ip address 10.10.2.25/24
 description "Backup LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface lo
!

ip forwarding

line vty
@seanfulton seanfulton added the triage Needs further investigation label Oct 16, 2019
@seanfulton
Copy link
Author

More info. I found that this 0.0.0.0 -> 10.10.1.254 is not coming from the router but from three of our ubuntu nodes (running FRR 7.1):
nj34.onecount.net> sh ip ro
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route

K>* 0.0.0.0/0 [0/0] via 10.10.1.254, primary-lan, 02w2d00h
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45
C>* 10.10.1.0/24 is directly connected, primary-lan, 02w2d00h
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 01w3d14h
C>* 10.10.2.0/24 is directly connected, backup-lan, 02w2d00h
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 02:14:06
R>* 10.10.4.1/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39
R>* 10.10.4.3/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.4/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.5/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39
R>* 10.10.4.7/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44
R>* 10.10.4.8/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44
R>* 10.10.4.9/32 [120/2] via 10.10.1.19, primary-lan, 00:43:00
R>* 10.10.4.11/32 [120/2] via 10.10.1.31, primary-lan, 1d06h33m

This comes from netplan (default routes added for each LAN segment).

So to sum up, machine 25 is getting a default route via 10.10.1.254 from machine 34 via rip. It is also getting default from 10.10.1.1 and 10.10.1.2 from BGP. Something is happening (I guess to machine 34 now) that is making the route inactive ... so why isn't RIP timing that route out and picking up the default from one of the two routers?

@seanfulton
Copy link
Author

I took the default routes of netplan.yaml in nj34 and ran netplan apply. The kernel routes above stayed in the routing table. I deleted both with ip route del 0.0.0.0/0.

I then ran netstat -nr | grep 0.0.0.0 several times and watched the default route get acquired from different machines in my network. Until it stopped and there was no more default route. Curious, I logged into zebra and did a sh ip ro, and got the following:
nj34.onecount.net> sh ip ro
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 00:00:15
R>* 10.0.0.9/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09
R>* 10.0.0.10/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09
C>* 10.10.1.0/24 is directly connected, primary-lan, 00:05:10
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 00:05:09
C>* 10.10.2.0/24 is directly connected, backup-lan, 00:05:10
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 00:04:58
R>* 10.10.4.1/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 00:05:09
R>* 10.10.4.3/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09

So even after I deleted the route manually, it is being held (long past all timers). I finally restarted frr and it picked up the default from one of the routers.

Very odd behavior.

@lucize
Copy link
Contributor

lucize commented Oct 17, 2019

can you reproduce it ?
I think is a Zebra issue, if you can try the 4.0.1 branch and revert like #5159 (comment), I didn't have time to revert and make it work for 5,6,7

@qlyoung qlyoung added the rip label Oct 22, 2019
@qlyoung
Copy link
Member

qlyoung commented Oct 22, 2019

@seanfulton can you possibly try to recreate this on a later version of Centos? We don't really do regression testing for 6 anymore given that it's more or less EOL at this point.

@seanfulton
Copy link
Author

I can confirm this is happening on centos 7, frr 7.2. Same exact behavior;

@seanfulton
Copy link
Author

R 0.0.0.0/0 [120/2] via 10.10.1.254 inactive, 01:30:53
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:25:34
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:25:34
R>* 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 04:08:13
C>* 10.10.1.0/24 is directly connected, bond0, 04:48:08
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, bond1, 04:47:55
C>* 10.10.2.0/24 is directly connected, bond1, 04:48:08
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, bond0, 00:05:22
R>* 10.10.4.1/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, bond0, 04:48:06
R>* 10.10.4.3/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.4/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.5/32 [120/2] via 10.10.1.27, bond0, 04:48:06
K>* 10.10.4.7/32 [0/0] is directly connected, venet0, 04:48:08
K>* 10.10.4.8/32 [0/0] is directly connected, venet0, 04:48:08
R>* 10.10.4.9/32 [120/2] via 10.10.1.19, bond0, 01:45:23
R>* 10.10.4.11/32 [120/2] via 10.10.1.31, bond0, 04:48:06
R>* 10.10.4.12/32 [120/2] via 10.10.1.4, bond0, 04:48:06
R>* 10.10.4.13/32 [120/2] via 10.10.1.30, bond0, 04:47:55
K>* 10.10.4.14/32 [0/0] is directly connected, venet0, 04:48:08
R>* 10.10.4.15/32 [120/2] via 10.10.1.5, bond0, 04:48:06
R>* 10.10.4.16/32 [120/2] via 10.10.1.6, bond0, 04:48:06
R>* 10.10.4.17/32 [120/2] via 10.10.2.35, bond1, 03:36:30
R>* 10.10.4.20/32 [120/2] via 10.10.1.26, bond0, 04:47:57
K>* 10.10.4.21/32 [0/0] is directly connected, venet0, 04:48:08

@seanfulton
Copy link
Author

What do you want me to do do here? This is becoming very problemmatic for us. Its happening on CENTOS 6, CENTOS 7 UBUNTU 18.04 on the 7.2 versions.

@seanfulton
Copy link
Author

Hey guys, this is a serious issue. I'm reverting all of our nodes back to Quagga until someone figures this out. Too risky to continue in production with this.
Happy to test anything any time, but this is not getting me where I need to be.
This is still a problem.

sean

@rzalamena
Copy link
Member

Seems related with #13561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rip triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

4 participants