Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoES status not OK on i-32 after adding 16K static routes #145

Open
sandeep-dutta opened this issue Jan 11, 2019 · 9 comments
Open

GoES status not OK on i-32 after adding 16K static routes #145

sandeep-dutta opened this issue Jan 11, 2019 · 9 comments
Labels

Comments

@sandeep-dutta
Copy link

sandeep-dutta commented Jan 11, 2019

Goes Version
root@invader29:/home/sandeep# goes vnetd -version
fe1: v1.1.3
fe1a: v1.1.0
vnet-platina-mk1: v1.0.0

Goes build checksum- 5046b7c2cdea8604d331dd7e5dd2fb9c85fa21ff

Kernel version
root@invader29:/home/sandeep# dpkg --list |grep kernel
ii linux-image-4.13-platina-mk1 4.13-165-gbf3b5fef4591 amd64 Linux kernel, version 4.13-platina-mk1

Noticed that when we add 16K static routes on invader-32 (172.17.2.32) & restart goes after that, vnted service is failing to come up. However this issue has been observed only on this invader. The other invaders participating in regression have vnetd up & running after adding 16k routes & restarting goes.

Steps to reproduce

  • Copy the following interface file which has 16K static routes under /etc/network/interfaces kept under /home/sandeep

root@invader32:/home/sandeep# cp 16k_static_route_interfaces /etc/network/interfaces

  • Execute the following cmds which will bring down & up the interfaces & restart goes services
    ifdown -a --allow vnet
    ifup -a --allow vnet
    goes restart

  • Noticed that vnetd status fails to come up.

root@invader32:/home/sandeep# goes status
GOES status

Mode - XETH
PCI - OK
Check daemons - OK
Check Redis - OK
Check vnet - Not OK
status: vnetd daemon not responding

@sandeep-dutta
Copy link
Author

Attached is the journalctl output for i-32 for last 5 mins.
journalctl_i32.txt

@stigt
Copy link
Collaborator

stigt commented Jan 15, 2019

It's well known that if we go over the tcam limit that vnet will call panic and crash. Did you see a vnet stack trace in /var/log/syslog to see if this is the know panic? Of course we need to do something better than crash at too many tcam entries, but I don't think that's in place yet.

@sandeep-dutta
Copy link
Author

We have not seen any panic trace under syslog for this issue. In this case we have not gone over the tcam limit. We have used 16035 entries to store these routes under tcam.

sandeep@invader32:~$ ip route |wc -l
16035

@kgkannan
Copy link

kgkannan commented Jan 17, 2019

Couple of inferences based on TH spec and the current goes driver support for L3DEFIP (TCAM) table:

  1. L3DEFIP is currently configured to support max 16K entries (each entry is actually 2 half-entries; half entry is a 32b entry); so the physical tcam limit is 8K rows, each row can accomodate 2 32b entries => 16K 32b entries.
  2. goes driver (like SDK) internally handles LPM ordering of various prefix lengths; there can be corner/limits cases depending on the prefix lengths of the entries being added.

questions for SQA team:
please share the sequence, type of entries wiith prefix lengths in the test case

Probable next steps for dev team:

  1. check if there s/w counters at vnet/fib level, fe1 level that can be dumped after add; may need to do this with instrumented image as logging could be disabled at compile time.
  2. force a core and dump these counters from the core file.

@sandeep-dutta
Copy link
Author

Hi Govind,
Please find the attached 16k interface file which contains 16K routes. The prefix length for all static route entries is /32.
16k_static_route_interfaces.txt

@sandeep-dutta
Copy link
Author

The issue is again reproducible on i-32. Attaching logs captured by show_tech.py script.

Current goes version running on i-32
root@invader32:/tmp/log# goes version
v1.2.0-rc.1

root@invader32:/tmp/log# dpkg --list |grep kernel
4.13.0-170-ga4eca81e3486
20190129_014719_069.zip

@rondv
Copy link
Contributor

rondv commented Jan 30, 2019

Govind is working on this (unable to update assignee)

@kgkannan
Copy link

kgkannan commented Feb 1, 2019

Quick update on debugging:

  1. problem reported is not a crash but a timeout failure reported by the script which does the config steps to add 16k static routes followed by goes restart. script expects goes-vnetd status OK within 40s (10s + 30s gracetime).
  2. problem was noticed on inv32 frequently but not seen in similar node in regression testbed.
  3. in addition since the show-tech logs include syslog/journalctl, noticed vnet-fib errors in adj path:
    adjacency.go mpAdjForAdj: index out of range adj AdjNil

Inferences:

  1. after goes restart, in addition to other fdb events vnetd should get 16K+ fdb (route + neighbor) events from kernel and then program 4*16K fib entries (fib, adj for 4 pipes) in TH via dma writes. from repro attempts, linux top shows vnet in tight loop as expected for I/O but mostly returns status OK within 10-15s; didn't see the case where it took > 30s.
  2. noticed adj errors too for few attempts but need more additional vnet logs enabled; when adj errors happen, fe1 does not show 16K routes programmed - its possible hang gets avoided with vnet bailing out with the adj error.

TBD:

  1. working with Sandeep to isolate test environment variables if any and recreate problem consistently
  2. with a baseline/consistent steps, try with reduced scale to debug better (with less logs and dataset).
  3. get more vnet/fdb logs to debug adj errors to triage further; get the exact goes version/tag to build instrumented image with build flags.

@sandeep-dutta
Copy link
Author

Hi Govind,

While executing regression with following GoES & kernel version we noticed that vnetd service for i-30 (172.17.2.30) was found not OK.

root@invader30:/tmp/log# goes version
v1.2.0-rc0
root@invader30:/tmp/log# dpkg --list |grep kernel
ii kmod 18-3 amd64 tools for managing Linux kernel modules
ii libdrm2:amd64 2.4.58-2 amd64 Userspace interface to kernel DRM services -- runtime
ii linux-image-4.13.0-platina-mk1 4.13.0-178-g13e3790c8eac amd64 Linux kernel, version 4.13.0-platina-mk1
ii rsyslog 8.4.2-1+deb8u2 amd64 reliable system and kernel logging daemon

root@invader30:/tmp/log# goes status
GOES status

Mode - XETH
PCI - OK
Check daemons - OK
Check Redis - OK
Check vnet - Not OK
status: vnetd daemon not responding

The steps that caused the failure was during bringing down & up the interfaces along with goes restart after execution of 16K static route test case

ifdown -a --allow vnet
ifup -a --allow vnet
goes restart

I manually tried calling the cmds but vnetd failed to come up.

Test case steps

  • Copy the custom interface file with 16K static routes under /etc/network/interfaces

  • Execute the following cmds to bring down & up the interfaces along with goes restart
    ifdown -a --allow vnet
    ifup -a --allow vnet
    goes restart

  • Validate if 16K routes are published under linux (ip route sh) & frr (show ip route)

  • Once the results are validated replace 16K custom interface file with default interface file

  • Execute the following cmds to bring down & up the interfaces along with goes restart
    ifdown -a --allow vnet
    ifup -a --allow vnet
    goes restart

  • Vnted fail to come up on i-30

However the issue did not occurred on any of the other invaders of setup-1 (i-29, i31 & i-32). I have not rebooted the invader & kept as it is. You could take a look at it. I will try to see if I can reproduce this on any other invader.

Please find the attached logs generated via show_tech.py script.
20190208_053642_837.zip

@rondv rondv removed their assignment Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants