-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BGP VRF (netns) test failures #1874
Comments
I have a pull request in to address the timing issue on AMD ... it failed
another test, and have no idea why
…On March 12, 2018 3:51:51 PM Rafael Zalamena ***@***.***> wrote:
It was noticed recently that the netns tests are failing too often. After
some investigation it seems that the bug lies in `i386` VRF with netns.
Here are some samples of failing tests (i386):
* https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-2838/artifact
* https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-2848/artifact
* https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-2845/artifact
* https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOI386-2842/artifact
They all fail with the same `zebra` log errors:
zebra standard output:
```
2018/03/11 05:12:08 ZEBRA: VRF_GET: (NULL)(1)
2018/03/11 05:12:53 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/11 05:12:53 ZEBRA: client 18 disconnected. 0 system routes removed
from the rib
2018/03/11 05:12:53 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/11 05:12:53 ZEBRA: client 17 disconnected. 0 system routes removed
from the rib
2018/03/11 05:12:53 ZEBRA: Terminating on signal
2018/03/11 05:12:53 ZEBRA: vrf_terminate: Shutting down vrf subsystem
2018/03/11 05:12:53 ZEBRA: VRF 1 is to be deleted.
2018/03/11 05:12:53 ZEBRA: VRF 1 is to be disabled.
2018/03/11 05:12:53 ZEBRA: VRF 4294967295 is to be deleted.
ZEBRA: Received signal 11 at 1520770373 (si_addr 0x4, PC 0xb7f0af66);
aborting...
Program counter: /usr/lib/libfrr.so.0(hash_clean+0x26)[0xb7f0af66]
Backtrace for 15 stack frames:
/usr/lib/libfrr.so.0(zlog_backtrace_sigsafe+0x48)[0xb7f13e58]
/usr/lib/libfrr.so.0(zlog_signal+0x340)[0xb7f14690]
/usr/lib/libfrr.so.0(+0x46ded)[0xb7f29ded]
[0xb7f8bcfc]
/usr/lib/libfrr.so.0(hash_clean+0x26)[0xb7f0af66]
/usr/lib/frr/zebra(zebra_ns_disable+0x2f)[0x4cb4cf]
/usr/lib/frr/zebra(zebra_ns_disabled+0x49)[0x4cb5f9]
/usr/lib/libfrr.so.0(ns_walk_func+0x37)[0xb7f18047]
/usr/lib/frr/zebra(+0x1aa97)[0x4b9a97]
/usr/lib/libfrr.so.0(quagga_sigevent_process+0x50)[0xb7f29ea0]
/usr/lib/libfrr.so.0(thread_fetch+0x608)[0xb7f36348]
/usr/lib/libfrr.so.0(frr_run+0xb6)[0xb7f12726]
/usr/lib/frr/zebra(main+0x318)[0x4ac5f8]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf7)[0xb7d13637]
/usr/lib/frr/zebra(+0xdc45)[0x4acc45]
no thread information available
```
Zebra standard error output:
```
2018/03/11 05:12:05 errors: ZEBRA: Can not enable NS 4294967295: Permission
denied!
2018/03/11 05:12:05 warnings: ZEBRA: Can not associate NS 4294967295 with
NETNS /run/netns/r1-cust1
2018/03/11 05:12:05 warnings: ZEBRA: NS notify : failed to create NS
/run/netns/r1-cust1
% VRF r1-cust1 not active
2018/03/11 05:12:05 errors: ZEBRA: ERROR: No such command on config line 3:
ip address 10.0.1.1/24
core_handler: showing active allocations in memory group libfrr
core_handler: memstats: Host config : 3 * (variably
sized)
core_handler: memstats: Command Tokens : 3330 * 48
core_handler: memstats: Command Token Text : 2471 * (variably
sized)
core_handler: memstats: Command Token Help : 2471 * (variably
sized)
core_handler: memstats: Command Argument Name : 618 * (variably
sized)
core_handler: memstats: Graph : 23 * 4
core_handler: memstats: Graph Node : 3912 * 16
core_handler: memstats: Hash : 76 * (variably
sized)
core_handler: memstats: Hash Bucket : 591 * 16
core_handler: memstats: Hash Index : 38 * (variably
sized)
core_handler: memstats: Hook entry : 4 * 28
core_handler: memstats: Link List : 11 * 20
core_handler: memstats: Link Node : 48 * 12
core_handler: memstats: Logging : 2 * (variably
sized)
core_handler: memstats: Temporary memory : 1 * (variably
sized)
core_handler: memstats: NetNS Context : 2 * 40
core_handler: memstats: NetNS Name : 2 * (variably
sized)
core_handler: memstats: Priority queue : 1 * 20
core_handler: memstats: Priority queue data : 1 * 128
core_handler: memstats: Privilege information : 2 * (variably
sized)
core_handler: memstats: Route table : 2 * 24
core_handler: memstats: Thread : 13 * (variably
sized)
core_handler: memstats: Thread master : 3 * (variably
sized)
core_handler: memstats: Thread stats : 9 * 36
core_handler: memstats: Vector : 7875 * 12
core_handler: memstats: Vector index : 7875 * (variably
sized)
core_handler: memstats: Work queue : 2 * (variably
sized)
core_handler: memstats: Work queue name string : 1 * (variably
sized)
core_handler: showing active allocations in memory group Label Manager
core_handler: showing active allocations in memory group zebra
core_handler: memstats: Zebra Name Space : 2 * 252
```
This is the topotest output:
```
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
2018-03-11 05:12:49,790 ERROR: assert failed at
"bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_convergence": BGP router
network did not converge
assert json["ipv4Unicast"]["peers"]["10.0.1.101"]["state"] value is different (
--- Expected value
+++ Current value
@@ -1 +1 @@
-"Established"
+"Idle")
json["ipv4Unicast"]["peers"]["10.0.1.101"]["prefixReceivedCount"] value is
different (
--- Expected value
+++ Current value
@@ -1 +1 @@
-10
+0)
2018-03-11 05:12:53,007 ERROR: assert failed at
"bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_vrf_netns": expected routes
in "show ip bgp vrf r1-cust1 ipv4" output
assert expected key(s) ['10.201.3.0/24', '10.201.6.0/24', '10.201.5.0/24',
'10.201.8.0/24', '10.201.9.0/24', '10.201.0.0/24', '10.201.2.0/24',
'10.201.1.0/24', '10.201.4.0/24', '10.201.7.0/24'] in json["routes"] (have []):
--- Expected value
+++ Current value
@@ -1,52 +1 @@
-{
- "10.201.0.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.1.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.2.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.3.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.4.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.5.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.6.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.7.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.8.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.9.0/24": [
- {
- "valid": true
- }
- ]
-}
+{}
```
---
The `amd64` failure that happened seems to be related with timing issues.
This is the only occurance I could find:
* https://ci1.netdef.org/browse/FRR-FRRPULLREQ-TOPOU1604-2831/artifact
Zebra output log:
```
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:37:43 ZEBRA: VRF_SWITCH: /run/netns/r1-cust1(2)
2018/03/09 15:37:43 ZEBRA: VRF_SWITCHBACK
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type
0 does not exist
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type
0 does not exist
2018/03/09 15:38:16 ZEBRA: zebra_redistribute_delete: Specified Route Type
0 does not exist
2018/03/09 15:38:16 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/09 15:38:16 ZEBRA: client 22 disconnected. 0 system routes removed
from the rib
2018/03/09 15:38:16 ZEBRA: release_daemon_chunks: Released 0 label chunks
2018/03/09 15:38:16 ZEBRA: client 21 disconnected. 0 system routes removed
from the rib
2018/03/09 15:38:16 ZEBRA: Terminating on signal
2018/03/09 15:38:16 ZEBRA: vrf_terminate: Shutting down vrf subsystem
2018/03/09 15:38:16 ZEBRA: VRF 1 is to be deleted.
2018/03/09 15:38:16 ZEBRA: VRF 1 is to be disabled.
2018/03/09 15:38:16 ZEBRA: VRF 2 is to be deleted.
2018/03/09 15:38:16 ZEBRA: VRF 2 is to be disabled.
```
There were no error messages or crashes.
Topotest output:
```
2018-03-09 15:38:12,968 ERROR: assert failed at
"bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_convergence": BGP router
network did not converge
assert json["ipv4Unicast"]["peers"]["10.0.1.101"]["state"] value is different (
--- Expected value
+++ Current value
@@ -1 +1 @@
-"Established"
+"Active")
json["ipv4Unicast"]["peers"]["10.0.1.101"]["prefixReceivedCount"] value is
different (
--- Expected value
+++ Current value
@@ -1 +1 @@
-10
+0)
2018-03-09 15:38:16,219 ERROR: assert failed at
"bgp_vrf_netns.test_bgp_vrf_netns_topo/test_bgp_vrf_netns": expected routes
in "show ip bgp vrf r1-cust1 ipv4" output
assert expected key(s) ['10.201.3.0/24', '10.201.6.0/24', '10.201.5.0/24',
'10.201.8.0/24', '10.201.9.0/24', '10.201.0.0/24', '10.201.2.0/24',
'10.201.1.0/24', '10.201.4.0/24', '10.201.7.0/24'] in json["routes"] (have []):
--- Expected value
+++ Current value
@@ -1,52 +1 @@
-{
- "10.201.0.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.1.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.2.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.3.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.4.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.5.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.6.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.7.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.8.0/24": [
- {
- "valid": true
- }
- ],
- "10.201.9.0/24": [
- {
- "valid": true
- }
- ]
-}
+{}
```
For the `amd64` error I think they could be fixed by simply bumping the
amount of time to wait for its convergence.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1874
|
I think this problem is already affecting our normal workflow. Couldn't we temporarily disable the netns topotest until the problem is found and fixed? |
|
#1876 fixes the issue, but agree that this is probably avoidance rather than a root cause fix |
@pguibert6WIND It looks to me that an error return (-1) may be mapped into VRF_UNKNOWN... but this is just a WAG based on 2018/03/11 05:12:53 ZEBRA: VRF 4294967295 is to be deleted. |
There are 3 issues, including 2 blocking, that I can see
I have 2 pull requests to submit:
|
for timer issue: increasing timer from 90 to 120 does not solve the issue. |
New artifacts output: Seems that zebra is crashing even on
zebra error output
|
let's see what #1880 will output as test result. |
it seems to be better. |
Fixed by #1880. |
It was noticed recently that the netns tests are failing too often. After some investigation it seems that the bug lies in
i386
VRF with netns.Here are some samples of failing tests (i386):
They all fail with the same
zebra
log errors:zebra standard output:
Zebra standard error output:
This is the topotest output:
The
amd64
failure that happened seems to be related with timing issues.This is the only occurance I could find:
Zebra output log:
There were no error messages or crashes.
Topotest output:
For the
amd64
error I think they could be fixed by simply bumping the amount of time to wait for its convergence.The text was updated successfully, but these errors were encountered: