Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP VRF (netns) test failures #1874

Closed
rzalamena opened this issue Mar 12, 2018 · 13 comments
Closed

BGP VRF (netns) test failures #1874

rzalamena opened this issue Mar 12, 2018 · 13 comments
Labels

Comments

@rzalamena
Copy link
Member

It was noticed recently that the netns tests are failing too often. After some investigation it seems that the bug lies in i386 VRF with netns.

Here are some samples of failing tests (i386):

They all fail with the same zebra log errors:
zebra standard output:

2018/03/11 05:12:08 ZEBRA: VRF_GET: (NULL)(1)                                                                                                                                                 
2018/03/11 05:12:53 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/11 05:12:53 ZEBRA: client 18 disconnected. 0 system routes removed from the rib
2018/03/11 05:12:53 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/11 05:12:53 ZEBRA: client 17 disconnected. 0 system routes removed from the rib
2018/03/11 05:12:53 ZEBRA: Terminating on signal
2018/03/11 05:12:53 ZEBRA: vrf_terminate: Shutting down vrf subsystem
2018/03/11 05:12:53 ZEBRA: VRF 1 is to be deleted.
2018/03/11 05:12:53 ZEBRA: VRF 1 is to be disabled.
2018/03/11 05:12:53 ZEBRA: VRF 4294967295 is to be deleted.
ZEBRA: Received signal 11 at 1520770373 (si_addr 0x4, PC 0xb7f0af66); aborting...
Program counter: /usr/lib/libfrr.so.0(hash_clean+0x26)[0xb7f0af66]
Backtrace for 15 stack frames:
/usr/lib/libfrr.so.0(zlog_backtrace_sigsafe+0x48)[0xb7f13e58]
/usr/lib/libfrr.so.0(zlog_signal+0x340)[0xb7f14690]
/usr/lib/libfrr.so.0(+0x46ded)[0xb7f29ded]
[0xb7f8bcfc]
/usr/lib/libfrr.so.0(hash_clean+0x26)[0xb7f0af66]
/usr/lib/frr/zebra(zebra_ns_disable+0x2f)[0x4cb4cf]
/usr/lib/frr/zebra(zebra_ns_disabled+0x49)[0x4cb5f9]
/usr/lib/libfrr.so.0(ns_walk_func+0x37)[0xb7f18047]
/usr/lib/frr/zebra(+0x1aa97)[0x4b9a97]
/usr/lib/libfrr.so.0(quagga_sigevent_process+0x50)[0xb7f29ea0]
/usr/lib/libfrr.so.0(thread_fetch+0x608)[0xb7f36348]
/usr/lib/libfrr.so.0(frr_run+0xb6)[0xb7f12726]
/usr/lib/frr/zebra(main+0x318)[0x4ac5f8]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0xb7d13637]
/usr/lib/frr/zebra(+0xdc45)[0x4acc45]
no thread information available

Zebra standard error output:

2018/03/11 05:12:05 errors: ZEBRA: Can not enable NS 4294967295: Permission denied!                                                                                                           
2018/03/11 05:12:05 warnings: ZEBRA: Can not associate NS 4294967295 with NETNS /run/netns/r1-cust1
2018/03/11 05:12:05 warnings: ZEBRA: NS notify : failed to create NS /run/netns/r1-cust1
% VRF r1-cust1 not active
2018/03/11 05:12:05 errors: ZEBRA: ERROR: No such command on config line 3:  ip address 10.0.1.1/24
core_handler: showing active allocations in memory group libfrr
core_handler: memstats:  Host config                   :      3 * (variably sized)
core_handler: memstats:  Command Tokens                :   3330 *         48
core_handler: memstats:  Command Token Text            :   2471 * (variably sized)
core_handler: memstats:  Command Token Help            :   2471 * (variably sized)
core_handler: memstats:  Command Argument Name         :    618 * (variably sized)
core_handler: memstats:  Graph                         :     23 *          4
core_handler: memstats:  Graph Node                    :   3912 *         16
core_handler: memstats:  Hash                          :     76 * (variably sized)
core_handler: memstats:  Hash Bucket                   :    591 *         16
core_handler: memstats:  Hash Index                    :     38 * (variably sized)
core_handler: memstats:  Hook entry                    :      4 *         28
core_handler: memstats:  Link List                     :     11 *         20
core_handler: memstats:  Link Node                     :     48 *         12
core_handler: memstats:  Logging                       :      2 * (variably sized)
core_handler: memstats:  Temporary memory              :      1 * (variably sized)
core_handler: memstats:  NetNS Context                 :      2 *         40
core_handler: memstats:  NetNS Name                    :      2 * (variably sized)
core_handler: memstats:  Priority queue                :      1 *         20
core_handler: memstats:  Priority queue data           :      1 *        128
core_handler: memstats:  Privilege information         :      2 * (variably sized)
core_handler: memstats:  Route table                   :      2 *         24
core_handler: memstats:  Thread                        :     13 * (variably sized)
core_handler: memstats:  Thread master                 :      3 * (variably sized)
core_handler: memstats:  Thread stats                  :      9 *         36
core_handler: memstats:  Vector                        :   7875 *         12
core_handler: memstats:  Vector index                  :   7875 * (variably sized)
core_handler: memstats:  Work queue                    :      2 * (variably sized)
core_handler: memstats:  Work queue name string        :      1 * (variably sized)
core_handler: showing active allocations in memory group Label Manager
core_handler: showing active allocations in memory group zebra
core_handler: memstats:  Zebra Name Space              :      2 *        252

This is the topotest output:

RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
2018-03-11 05:12:49,790 ERROR: assert failed at "bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_convergence": BGP router network did not converge
assert json["ipv4Unicast"]["peers"]["10.0.1.101"]["state"] value is different (
  --- Expected value
  +++ Current value
  @@ -1 +1 @@
  -"Established"
  +"Idle")
  json["ipv4Unicast"]["peers"]["10.0.1.101"]["prefixReceivedCount"] value is different (
  --- Expected value
  +++ Current value
  @@ -1 +1 @@
  -10
  +0)
2018-03-11 05:12:53,007 ERROR: assert failed at "bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_vrf_netns": expected routes in "show ip bgp vrf r1-cust1 ipv4" output
assert expected key(s) ['10.201.3.0/24', '10.201.6.0/24', '10.201.5.0/24', '10.201.8.0/24', '10.201.9.0/24', '10.201.0.0/24', '10.201.2.0/24', '10.201.1.0/24', '10.201.4.0/24', '10.201.7.0/24'] in json["routes"] (have []):
  --- Expected value
  +++ Current value
  @@ -1,52 +1 @@
  -{
  -    "10.201.0.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.1.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.2.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.3.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.4.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.5.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.6.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.7.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.8.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.9.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ]
  -}
  +{}

The amd64 failure that happened seems to be related with timing issues.

This is the only occurance I could find:

Zebra output log:

2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)                                                                                                                                 
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type 0 does not exist
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type 0 does not exist
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type 0 does not exist
2018/03/09 15:38:16 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/09 15:38:16 ZEBRA: client 22 disconnected. 0 system routes removed from the rib
2018/03/09 15:38:16 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/09 15:38:16 ZEBRA: client 21 disconnected. 0 system routes removed from the rib
2018/03/09 15:38:16 ZEBRA: Terminating on signal
2018/03/09 15:38:16 ZEBRA: vrf_terminate: Shutting down vrf subsystem
2018/03/09 15:38:16 ZEBRA: VRF 1 is to be deleted.
2018/03/09 15:38:16 ZEBRA: VRF 1 is to be disabled.
2018/03/09 15:38:16 ZEBRA: VRF 2 is to be deleted.
2018/03/09 15:38:16 ZEBRA: VRF 2 is to be disabled.

There were no error messages or crashes.

Topotest output:

2018-03-09 15:38:12,968 ERROR: assert failed at "bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_convergence": BGP router network did not converge
assert json["ipv4Unicast"]["peers"]["10.0.1.101"]["state"] value is different (
  --- Expected value
  +++ Current value
  @@ -1 +1 @@
  -"Established"
  +"Active")
  json["ipv4Unicast"]["peers"]["10.0.1.101"]["prefixReceivedCount"] value is different (
  --- Expected value
  +++ Current value
  @@ -1 +1 @@
  -10
  +0)
2018-03-09 15:38:16,219 ERROR: assert failed at "bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_vrf_netns": expected routes in "show ip bgp vrf r1-cust1 ipv4" output
assert expected key(s) ['10.201.3.0/24', '10.201.6.0/24', '10.201.5.0/24', '10.201.8.0/24', '10.201.9.0/24', '10.201.0.0/24', '10.201.2.0/24', '10.201.1.0/24', '10.201.4.0/24', '10.201.7.0/24'] in json["routes"] (have []):
  --- Expected value
  +++ Current value
  @@ -1,52 +1 @@
  -{
  -    "10.201.0.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.1.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.2.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.3.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.4.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.5.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.6.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.7.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.8.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ], 
  -    "10.201.9.0/24": [
  -        {
  -            "valid": true
  -        }
  -    ]
  -}
  +{}

For the amd64 error I think they could be fixed by simply bumping the amount of time to wait for its convergence.

@louberger
Copy link
Member

louberger commented Mar 12, 2018 via email

@rwestphal
Copy link
Member

I think this problem is already affecting our normal workflow. Couldn't we temporarily disable the netns topotest until the problem is found and fixed?

@qlyoung qlyoung added the bug label Mar 13, 2018
@pguibert6WIND
Copy link
Member

  • I am looking right now the i386 issue.
    I am surprised that i386 is still in use ( 32 bit). On last week discussion , we discussed EOL for some linux distros, but we did not discuss 32 bit. I assume this is because 32 bit is still popular.
  • for the timer issue, I think Lou may have some idea behind that.

@louberger
Copy link
Member

#1876 fixes the issue, but agree that this is probably avoidance rather than a root cause fix

@louberger
Copy link
Member

@pguibert6WIND It looks to me that an error return (-1) may be mapped into VRF_UNKNOWN...

but this is just a WAG based on

2018/03/11 05:12:53 ZEBRA: VRF 4294967295 is to be deleted.

@pguibert6WIND
Copy link
Member

pguibert6WIND commented Mar 13, 2018

There are 3 issues, including 2 blocking, that I can see

  • the display ( out of scope for now). displaying -1 instead of 4294.
    This is not what triggered the opening of the issue
  • an impossiblity to use NETNS on the 32bit system I used ( I used 4.5 kernel on 32 bit system, and I could not figure out how to make netns work until now).
  • an inconsistency that indicates that zebra vrf considers ZVRF is disabled ( value is -1 , that is to say 4294967295), while lib vrf considers VRF is enabled. I think that inconsistency is the root cause of the crash.

I have 2 pull requests to submit:

@pguibert6WIND
Copy link
Member

pguibert6WIND commented Mar 13, 2018

for timer issue: increasing timer from 90 to 120 does not solve the issue.
( see test result amd64: #1880).
observation differs from #1874, since the BGP VRF is not enabled on amd64 => https://ci1.netdef.org/download/FRR-FRRPULLREQ-TOPOU1604/build_logs/FRR-FRRPULLREQ-TOPOU1604-2863.log )

@rzalamena
Copy link
Member Author

New artifacts output:
https://ci1.netdef.org/artifact/FRR-FRRPULLREQ/TOPOU1604/build-2863/TestExecutionLogs/bgp_vrf_netns.test_bgp_vrf_netns_topo

Seems that zebra is crashing even on amd64:
zebra output

2018/03/13 05:15:47 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/13 05:15:47 ZEBRA: client 18 disconnected. 0 system routes removed from the rib
2018/03/13 05:15:47 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/13 05:15:47 ZEBRA: client 17 disconnected. 0 system routes removed from the rib
2018/03/13 05:15:47 ZEBRA: Terminating on signal
2018/03/13 05:15:47 ZEBRA: vrf_terminate: Shutting down vrf subsystem
2018/03/13 05:15:47 ZEBRA: VRF 1 is to be deleted.
2018/03/13 05:15:47 ZEBRA: VRF 1 is to be disabled.
2018/03/13 05:15:47 ZEBRA: VRF 4294967295 is to be deleted.
ZEBRA: Received signal 11 at 1520943347 (si_addr 0x8, PC 0x7f1d6d89f6a7); aborting...
Program counter: /usr/lib/libfrr.so.0(hash_clean+0x17)[0x7f1d6d89f6a7]
Backtrace for 15 stack frames:
/usr/lib/libfrr.so.0(zlog_backtrace_sigsafe+0x48)[0x7f1d6d8a7fd8]
/usr/lib/libfrr.so.0(zlog_signal+0x33b)[0x7f1d6d8a87bb]
/usr/lib/libfrr.so.0(+0x4e62f)[0x7f1d6d8bc62f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f1d6d24f390]
/usr/lib/libfrr.so.0(hash_clean+0x17)[0x7f1d6d89f6a7]
/usr/lib/frr/zebra(zebra_ns_disable+0x21)[0x55e516a52031]
/usr/lib/frr/zebra(zebra_ns_disabled+0x42)[0x55e516a52132]
/usr/lib/libfrr.so.0(ns_walk_func+0x2e)[0x7f1d6d8ac22e]
/usr/lib/frr/zebra(+0x222be)[0x55e516a412be]
/usr/lib/libfrr.so.0(quagga_sigevent_process+0x58)[0x7f1d6d8bc6e8]
/usr/lib/libfrr.so.0(thread_fetch+0x5fd)[0x7f1d6d8c894d]
/usr/lib/libfrr.so.0(frr_run+0xc3)[0x7f1d6d8a6833]
/usr/lib/frr/zebra(main+0x2ad)[0x55e516a349fd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f1d6ce94830]
/usr/lib/frr/zebra(_start+0x29)[0x55e516a35049]
no thread information available

zebra error output

2018/03/13 05:14:03 errors: ZEBRA: Can not enable NS 2: Operation not permitted!
2018/03/13 05:14:03 warnings: ZEBRA: Can not associate NS 2 with NETNS /run/netns/r1-cust1
2018/03/13 05:14:03 warnings: ZEBRA: NS notify : failed to create NS /run/netns/r1-cust1
% VRF r1-cust1 not active
2018/03/13 05:14:03 errors: ZEBRA: ERROR: No such command on config line 3:  ip address 10.0.1.1/24
core_handler: showing active allocations in memory group libfrr
core_handler: memstats:  Host config                   :      3 * (variably sized)
core_handler: memstats:  Command Tokens                :   3330 *         72
core_handler: memstats:  Command Token Text            :   2471 * (variably sized)
core_handler: memstats:  Command Token Help            :   2471 * (variably sized)
core_handler: memstats:  Command Argument Name         :    618 * (variably sized)
core_handler: memstats:  Graph                         :     23 *          8
core_handler: memstats:  Graph Node                    :   3912 *         32
core_handler: memstats:  Hash                          :     76 * (variably sized)
core_handler: memstats:  Hash Bucket                   :    593 *         32
core_handler: memstats:  Hash Index                    :     38 * (variably sized)
core_handler: memstats:  Hook entry                    :      4 *         48
core_handler: memstats:  Link List                     :     11 *         40
core_handler: memstats:  Link Node                     :     48 *         24
core_handler: memstats:  Logging                       :      2 * (variably sized)
core_handler: memstats:  Temporary memory              :      1 * (variably sized)
core_handler: memstats:  NetNS Context                 :      2 *         80
core_handler: memstats:  NetNS Name                    :      2 * (variably sized)
core_handler: memstats:  Priority queue                :      1 *         32
core_handler: memstats:  Priority queue data           :      1 *        256
core_handler: memstats:  Privilege information         :      2 * (variably sized)
core_handler: memstats:  Stream                        :      4 *         40
core_handler: memstats:  Stream data                   :      4 * (variably sized)
core_handler: memstats:  Route table                   :      2 *         48
core_handler: memstats:  Thread                        :     14 * (variably sized)
core_handler: memstats:  Thread master                 :      3 * (variably sized)
core_handler: memstats:  Thread stats                  :     11 *         64
core_handler: memstats:  Vector                        :   7875 *         16
core_handler: memstats:  Vector index                  :   7875 * (variably sized)
core_handler: memstats:  Work queue                    :      2 * (variably sized)
core_handler: memstats:  Work queue name string        :      1 * (variably sized)
core_handler: showing active allocations in memory group Label Manager
core_handler: showing active allocations in memory group zebra
core_handler: memstats:  Zebra Name Space              :      2 *        288

@pguibert6WIND
Copy link
Member

ok. I Get it.
both issues are related to the way frr is compiled ( user capabilities...).
#1880 should give the answer to the fixes.

and I think topotest #77 should be reverted

@pguibert6WIND
Copy link
Member

let's see what #1880 will output as test result.

@pguibert6WIND
Copy link
Member

it seems to be better.
I just remove the first and last commit from the review.
IT should be ready to be merged.

@pguibert6WIND
Copy link
Member

#1880

@rwestphal
Copy link
Member

Fixed by #1880.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants