RIP routes marked inactive and not being replaced #5174

seanfulton · 2019-10-16T20:00:22Z

We are using FRR RPM frr-7.0-01.el6.x86_64 on CENTOS 6. We've used Quagga up until about a month ago with no problems but upgraded to FRR. Since then we've noticed that machines will randomly lose their default route. When I examine the routing table, I'll see the default route marked as a RIP route but inactive.

This seems similar to: #4535

About our network: We have two border routers running zebra. Each gets a default route via BGP and advertises it to the network using RIP. We have a static IP (#.#.#.254) that floats from router to router that non-RIP devices can use as a default GW.

When the hang occurs, I see this:

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 06:53:00
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:42:50
R>* 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 00:10:24

If I restart FRR, it immediately picks up a new default via RIP from 10.10.1.1 or 10.10.2.1, depending.

So my theory is that something causes the .254 address to flip over from say router A to router B.

My feeling is that if this .254 address becomes inactive, it should be flushed from the routing table and a new route gained from rip for either 10.10.1.1 or 10.10.1.2. Instead, the old route hangs.

Any idea why?

ripd.conf:

log file /var/log/zebra.log
!debug rip events
!debug rip zebra
!debug rip packet

!
interface bond0
ip rip split-horizon
no ip rip authentication mode
!
interface bond1
ip rip split-horizon
no ip rip authentication mode
!


router rip
version 2
timers basic 15 30 30
redistribute kernel 
no redistribute connected
no redistribute static


network 74.201.36.0/22
network 74.201.40.0/22
network 172.81.88.0/22
network 10.0.0.0/8


line vty

zebra.conf:

!
interface bond0
 ip address 10.10.1.25/24
 description "Primary LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface bond1
 ip address 10.10.2.25/24
 description "Backup LAN" 
link-detect
! ipv6 nd suppress-ra
!
interface lo
!

ip forwarding

line vty

The text was updated successfully, but these errors were encountered:

seanfulton · 2019-10-16T20:27:37Z

More info. I found that this 0.0.0.0 -> 10.10.1.254 is not coming from the router but from three of our ubuntu nodes (running FRR 7.1):
nj34.onecount.net> sh ip ro
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route

K>* 0.0.0.0/0 [0/0] via 10.10.1.254, primary-lan, 02w2d00h
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, backup-lan, 20:50:45
C>* 10.10.1.0/24 is directly connected, primary-lan, 02w2d00h
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 01w3d14h
C>* 10.10.2.0/24 is directly connected, backup-lan, 02w2d00h
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 02:14:06
R>* 10.10.4.1/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39
R>* 10.10.4.3/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.4/32 [120/2] via 10.10.1.26, primary-lan, 04:46:44
R>* 10.10.4.5/32 [120/2] via 10.10.1.27, primary-lan, 02:29:39
R>* 10.10.4.7/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44
R>* 10.10.4.8/32 [120/2] via 10.10.1.25, primary-lan, 10:23:44
R>* 10.10.4.9/32 [120/2] via 10.10.1.19, primary-lan, 00:43:00
R>* 10.10.4.11/32 [120/2] via 10.10.1.31, primary-lan, 1d06h33m

This comes from netplan (default routes added for each LAN segment).

So to sum up, machine 25 is getting a default route via 10.10.1.254 from machine 34 via rip. It is also getting default from 10.10.1.1 and 10.10.1.2 from BGP. Something is happening (I guess to machine 34 now) that is making the route inactive ... so why isn't RIP timing that route out and picking up the default from one of the two routers?

seanfulton · 2019-10-16T20:37:25Z

I took the default routes of netplan.yaml in nj34 and ran netplan apply. The kernel routes above stayed in the routing table. I deleted both with ip route del 0.0.0.0/0.

I then ran netstat -nr | grep 0.0.0.0 several times and watched the default route get acquired from different machines in my network. Until it stopped and there was no more default route. Curious, I logged into zebra and did a sh ip ro, and got the following:
nj34.onecount.net> sh ip ro
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route

R 0.0.0.0/0 [120/2] via 10.10.2.254 inactive, 00:00:15
R>* 10.0.0.9/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09
R>* 10.0.0.10/32 [120/2] via 10.10.1.2, primary-lan, 00:05:09
C>* 10.10.1.0/24 is directly connected, primary-lan, 00:05:10
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, backup-lan, 00:05:09
C>* 10.10.2.0/24 is directly connected, backup-lan, 00:05:10
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, primary-lan, 00:04:58
R>* 10.10.4.1/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, primary-lan, 00:05:09
R>* 10.10.4.3/32 [120/2] via 10.10.2.26, backup-lan, 00:05:09

So even after I deleted the route manually, it is being held (long past all timers). I finally restarted frr and it picked up the default from one of the routers.

Very odd behavior.

lucize · 2019-10-17T07:06:28Z

can you reproduce it ?
I think is a Zebra issue, if you can try the 4.0.1 branch and revert like #5159 (comment), I didn't have time to revert and make it work for 5,6,7

qlyoung · 2019-10-22T15:14:58Z

@seanfulton can you possibly try to recreate this on a later version of Centos? We don't really do regression testing for 6 anymore given that it's more or less EOL at this point.

seanfulton · 2019-11-05T14:11:40Z

I can confirm this is happening on centos 7, frr 7.2. Same exact behavior;

seanfulton · 2019-11-05T19:42:15Z

R 0.0.0.0/0 [120/2] via 10.10.1.254 inactive, 01:30:53
R>* 10.0.0.9/32 [120/2] via 10.10.2.2, bond1, 00:25:34
R>* 10.0.0.10/32 [120/2] via 10.10.2.2, bond1, 00:25:34
R>* 10.0.3.0/24 [120/2] via 10.10.2.34, bond1, 04:08:13
C>* 10.10.1.0/24 is directly connected, bond0, 04:48:08
R>* 10.10.1.254/32 [120/2] via 10.10.2.1, bond1, 04:47:55
C>* 10.10.2.0/24 is directly connected, bond1, 04:48:08
R>* 10.10.2.254/32 [120/2] via 10.10.1.1, bond0, 00:05:22
R>* 10.10.4.1/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.2/32 [120/2] via 10.10.1.27, bond0, 04:48:06
R>* 10.10.4.3/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.4/32 [120/2] via 10.10.1.26, bond0, 04:47:57
R>* 10.10.4.5/32 [120/2] via 10.10.1.27, bond0, 04:48:06
K>* 10.10.4.7/32 [0/0] is directly connected, venet0, 04:48:08
K>* 10.10.4.8/32 [0/0] is directly connected, venet0, 04:48:08
R>* 10.10.4.9/32 [120/2] via 10.10.1.19, bond0, 01:45:23
R>* 10.10.4.11/32 [120/2] via 10.10.1.31, bond0, 04:48:06
R>* 10.10.4.12/32 [120/2] via 10.10.1.4, bond0, 04:48:06
R>* 10.10.4.13/32 [120/2] via 10.10.1.30, bond0, 04:47:55
K>* 10.10.4.14/32 [0/0] is directly connected, venet0, 04:48:08
R>* 10.10.4.15/32 [120/2] via 10.10.1.5, bond0, 04:48:06
R>* 10.10.4.16/32 [120/2] via 10.10.1.6, bond0, 04:48:06
R>* 10.10.4.17/32 [120/2] via 10.10.2.35, bond1, 03:36:30
R>* 10.10.4.20/32 [120/2] via 10.10.1.26, bond0, 04:47:57
K>* 10.10.4.21/32 [0/0] is directly connected, venet0, 04:48:08

seanfulton · 2019-11-17T13:32:04Z

What do you want me to do do here? This is becoming very problemmatic for us. Its happening on CENTOS 6, CENTOS 7 UBUNTU 18.04 on the 7.2 versions.

seanfulton · 2019-11-26T17:41:10Z

Hey guys, this is a serious issue. I'm reverting all of our nodes back to Quagga until someone figures this out. Too risky to continue in production with this.
Happy to test anything any time, but this is not getting me where I need to be.
This is still a problem.

sean

rzalamena · 2023-09-22T13:59:31Z

Seems related with #13561

seanfulton added the triage Needs further investigation label Oct 16, 2019

qlyoung added the rip label Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RIP routes marked inactive and not being replaced #5174

RIP routes marked inactive and not being replaced #5174

seanfulton commented Oct 16, 2019 •

edited by qlyoung

Loading

seanfulton commented Oct 16, 2019

seanfulton commented Oct 16, 2019

lucize commented Oct 17, 2019

qlyoung commented Oct 22, 2019

seanfulton commented Nov 5, 2019

seanfulton commented Nov 5, 2019

seanfulton commented Nov 17, 2019

seanfulton commented Nov 26, 2019

rzalamena commented Sep 22, 2023

RIP routes marked inactive and not being replaced #5174

RIP routes marked inactive and not being replaced #5174

Comments

seanfulton commented Oct 16, 2019 • edited by qlyoung Loading

seanfulton commented Oct 16, 2019

seanfulton commented Oct 16, 2019

lucize commented Oct 17, 2019

qlyoung commented Oct 22, 2019

seanfulton commented Nov 5, 2019

seanfulton commented Nov 5, 2019

seanfulton commented Nov 17, 2019

seanfulton commented Nov 26, 2019

rzalamena commented Sep 22, 2023

seanfulton commented Oct 16, 2019 •

edited by qlyoung

Loading