GoES status not OK on i-32 after adding 16K static routes #145

sandeep-dutta · 2019-01-11T10:06:17Z

Goes Version
root@invader29:/home/sandeep# goes vnetd -version
fe1: v1.1.3
fe1a: v1.1.0
vnet-platina-mk1: v1.0.0

Goes build checksum- 5046b7c2cdea8604d331dd7e5dd2fb9c85fa21ff

Kernel version
root@invader29:/home/sandeep# dpkg --list |grep kernel
ii linux-image-4.13-platina-mk1 4.13-165-gbf3b5fef4591 amd64 Linux kernel, version 4.13-platina-mk1

Noticed that when we add 16K static routes on invader-32 (172.17.2.32) & restart goes after that, vnted service is failing to come up. However this issue has been observed only on this invader. The other invaders participating in regression have vnetd up & running after adding 16k routes & restarting goes.

Steps to reproduce

Copy the following interface file which has 16K static routes under /etc/network/interfaces kept under /home/sandeep

root@invader32:/home/sandeep# cp 16k_static_route_interfaces /etc/network/interfaces

Execute the following cmds which will bring down & up the interfaces & restart goes services
ifdown -a --allow vnet
ifup -a --allow vnet
goes restart
Noticed that vnetd status fails to come up.

root@invader32:/home/sandeep# goes status
GOES status

Mode - XETH
PCI - OK
Check daemons - OK
Check Redis - OK
Check vnet - Not OK
status: vnetd daemon not responding

sandeep-dutta · 2019-01-15T07:02:58Z

Attached is the journalctl output for i-32 for last 5 mins.
journalctl_i32.txt

stigt · 2019-01-15T07:24:22Z

It's well known that if we go over the tcam limit that vnet will call panic and crash. Did you see a vnet stack trace in /var/log/syslog to see if this is the know panic? Of course we need to do something better than crash at too many tcam entries, but I don't think that's in place yet.

sandeep-dutta · 2019-01-15T07:50:19Z

We have not seen any panic trace under syslog for this issue. In this case we have not gone over the tcam limit. We have used 16035 entries to store these routes under tcam.

sandeep@invader32:~$ ip route |wc -l
16035

kgkannan · 2019-01-17T19:35:30Z

Couple of inferences based on TH spec and the current goes driver support for L3DEFIP (TCAM) table:

L3DEFIP is currently configured to support max 16K entries (each entry is actually 2 half-entries; half entry is a 32b entry); so the physical tcam limit is 8K rows, each row can accomodate 2 32b entries => 16K 32b entries.
goes driver (like SDK) internally handles LPM ordering of various prefix lengths; there can be corner/limits cases depending on the prefix lengths of the entries being added.

questions for SQA team:
please share the sequence, type of entries wiith prefix lengths in the test case

Probable next steps for dev team:

check if there s/w counters at vnet/fib level, fe1 level that can be dumped after add; may need to do this with instrumented image as logging could be disabled at compile time.
force a core and dump these counters from the core file.

sandeep-dutta · 2019-01-18T14:31:44Z

Hi Govind,
Please find the attached 16k interface file which contains 16K routes. The prefix length for all static route entries is /32.
16k_static_route_interfaces.txt

sandeep-dutta · 2019-01-29T10:15:24Z

The issue is again reproducible on i-32. Attaching logs captured by show_tech.py script.

Current goes version running on i-32
root@invader32:/tmp/log# goes version
v1.2.0-rc.1

root@invader32:/tmp/log# dpkg --list |grep kernel
4.13.0-170-ga4eca81e3486
20190129_014719_069.zip

rondv · 2019-01-30T00:48:57Z

Govind is working on this (unable to update assignee)

kgkannan · 2019-02-01T21:37:25Z

Quick update on debugging:

problem reported is not a crash but a timeout failure reported by the script which does the config steps to add 16k static routes followed by goes restart. script expects goes-vnetd status OK within 40s (10s + 30s gracetime).
problem was noticed on inv32 frequently but not seen in similar node in regression testbed.
in addition since the show-tech logs include syslog/journalctl, noticed vnet-fib errors in adj path:
adjacency.go mpAdjForAdj: index out of range adj AdjNil

Inferences:

after goes restart, in addition to other fdb events vnetd should get 16K+ fdb (route + neighbor) events from kernel and then program 4*16K fib entries (fib, adj for 4 pipes) in TH via dma writes. from repro attempts, linux top shows vnet in tight loop as expected for I/O but mostly returns status OK within 10-15s; didn't see the case where it took > 30s.
noticed adj errors too for few attempts but need more additional vnet logs enabled; when adj errors happen, fe1 does not show 16K routes programmed - its possible hang gets avoided with vnet bailing out with the adj error.

TBD:

working with Sandeep to isolate test environment variables if any and recreate problem consistently
with a baseline/consistent steps, try with reduced scale to debug better (with less logs and dataset).
get more vnet/fdb logs to debug adj errors to triage further; get the exact goes version/tag to build instrumented image with build flags.

sandeep-dutta · 2019-02-08T14:30:50Z

Hi Govind,

While executing regression with following GoES & kernel version we noticed that vnetd service for i-30 (172.17.2.30) was found not OK.

root@invader30:/tmp/log# goes version
v1.2.0-rc0
root@invader30:/tmp/log# dpkg --list |grep kernel
ii kmod 18-3 amd64 tools for managing Linux kernel modules
ii libdrm2:amd64 2.4.58-2 amd64 Userspace interface to kernel DRM services -- runtime
ii linux-image-4.13.0-platina-mk1 4.13.0-178-g13e3790c8eac amd64 Linux kernel, version 4.13.0-platina-mk1
ii rsyslog 8.4.2-1+deb8u2 amd64 reliable system and kernel logging daemon

root@invader30:/tmp/log# goes status
GOES status

Mode - XETH
PCI - OK
Check daemons - OK
Check Redis - OK
Check vnet - Not OK
status: vnetd daemon not responding

The steps that caused the failure was during bringing down & up the interfaces along with goes restart after execution of 16K static route test case

ifdown -a --allow vnet
ifup -a --allow vnet
goes restart

I manually tried calling the cmds but vnetd failed to come up.

Test case steps

Copy the custom interface file with 16K static routes under /etc/network/interfaces
Execute the following cmds to bring down & up the interfaces along with goes restart
ifdown -a --allow vnet
ifup -a --allow vnet
goes restart
Validate if 16K routes are published under linux (ip route sh) & frr (show ip route)
Once the results are validated replace 16K custom interface file with default interface file
Execute the following cmds to bring down & up the interfaces along with goes restart
ifdown -a --allow vnet
ifup -a --allow vnet
goes restart
Vnted fail to come up on i-30

However the issue did not occurred on any of the other invaders of setup-1 (i-29, i31 & i-32). I have not rebooted the invader & kept as it is. You could take a look at it. I will try to see if I can reproduce this on any other invader.

Please find the attached logs generated via show_tech.py script.
20190208_053642_837.zip

sandeep-dutta added the calsoft label Jan 11, 2019

sandeep-dutta assigned rondv Jan 11, 2019

rondv removed their assignment Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoES status not OK on i-32 after adding 16K static routes #145

GoES status not OK on i-32 after adding 16K static routes #145

sandeep-dutta commented Jan 11, 2019 •

edited

Loading

sandeep-dutta commented Jan 15, 2019

stigt commented Jan 15, 2019

sandeep-dutta commented Jan 15, 2019

kgkannan commented Jan 17, 2019 •

edited

Loading

sandeep-dutta commented Jan 18, 2019

sandeep-dutta commented Jan 29, 2019

rondv commented Jan 30, 2019

kgkannan commented Feb 1, 2019 •

edited

Loading

sandeep-dutta commented Feb 8, 2019

GoES status not OK on i-32 after adding 16K static routes #145

GoES status not OK on i-32 after adding 16K static routes #145

Comments

sandeep-dutta commented Jan 11, 2019 • edited Loading

root@invader32:/home/sandeep# goes status GOES status

sandeep-dutta commented Jan 15, 2019

stigt commented Jan 15, 2019

sandeep-dutta commented Jan 15, 2019

kgkannan commented Jan 17, 2019 • edited Loading

sandeep-dutta commented Jan 18, 2019

sandeep-dutta commented Jan 29, 2019

rondv commented Jan 30, 2019

kgkannan commented Feb 1, 2019 • edited Loading

sandeep-dutta commented Feb 8, 2019

root@invader30:/tmp/log# goes status GOES status

sandeep-dutta commented Jan 11, 2019 •

edited

Loading

root@invader32:/home/sandeep# goes status
GOES status

kgkannan commented Jan 17, 2019 •

edited

Loading

kgkannan commented Feb 1, 2019 •

edited

Loading

root@invader30:/tmp/log# goes status
GOES status