Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scale 10k arp] Traffic drop while arp is being learned #21645

Open
DavidZagury opened this issue Feb 6, 2025 · 6 comments
Open

[Scale 10k arp] Traffic drop while arp is being learned #21645

DavidZagury opened this issue Feb 6, 2025 · 6 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@DavidZagury
Copy link
Contributor

A PR recently enabled by default Receive Packet Steering (RPS) - #20211
It caused a degradation that can be seen between 202405 and 202411 branches.
When reverting this PR or disabling RPS by removing the linux configuration this PR set (no user friendly way to do so) - the test pass.

Steps to reproduce

configure l2 and l3 interfaces

config vlan add 363
config vlan member add -u 363 Ethernet176
config interface ip add Vlan363 101.1.0.1/24
config interface ip add Ethernet160 100.1.0.1/16
config save -y

modify copp policy to raise rate limits

sed -i '/"default": {/,/}/{s/"cir": "[0-9]"/"cir": "60000"/; s/"cbs": "[0-9]"/"cbs": "60000"/}' /usr/share/sonic/templates/copp_cfg.j2
sed -i '/"queue4_group2": {/,/}/{s/"cir": "[0-9]"/"cir": "60000"/; s/"cbs": "[0-9]"/"cbs": "60000"/}' /usr/share/sonic/templates/copp_cfg.j2
reboot

modify gc_thresholds to higher values

sysctl -w net.ipv4.neigh.default.gc_thresh1=65535
sysctl -w net.ipv4.neigh.default.gc_thresh2=65535
sysctl -w net.ipv4.neigh.default.gc_thresh3=65535

clear arp table. start traffic with 10000 pps and 10000 end hosts

sonic-clear arp
Send bidirectional l3 traffic (simulating a single host pinging 10k hosts on other side)

Observed behavior

A small percentage of traffic (<1%) is lost.

Expected behavior

There should not be any losses.

@bingwang-ms
Copy link
Contributor

@prabhataravind Can you help take a look?

@prabhataravind prabhataravind self-assigned this Feb 10, 2025
@prabhataravind
Copy link
Contributor

@DavidZagury why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?

Also, do you notice any kernel errors in syslog related to rps? Do you mind sharing the dump of cat /proc/net/softnet_stat in the good and bad cases?

Also, do you see any ksoftirqd kernel threads consuming a lot of CPU during ARP packet processing?

@prabhataravind prabhataravind added the Triaged this issue has been triaged label Feb 11, 2025
@DavidZagury
Copy link
Contributor Author

@DavidZagury why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?

Will check with the test owner.

Also, do you notice any kernel errors in syslog related to rps? Do you mind sharing the dump of cat /proc/net/softnet_stat in the good and bad cases?

After drops have been seen:

# cat /proc/net/softnet_stat
0007d899 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e27 00000000 00000000 00000000
0008ac2c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000715c 00000000 00000000 00000001
00088a35 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e8f 00000000 00000000 00000002
00083a14 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003
0008059a 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e35 00000000 00000000 00000004
00084732 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000698b 00000000 00000000 00000005
00098618 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00014bdf 00000000 00000000 00000006
00095a68 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006eac 00000000 00000000 00000007

Also, do you see any ksoftirqd kernel threads consuming a lot of CPU during ARP packet processing?

I don't manage to see such increase in consumption, but - the ARP learning process is quite fast, don't know if regular tools will catch any increase in consumption that will quickly disappear since all ARPs needed has been learnt.

@dprital
Copy link
Collaborator

dprital commented Feb 19, 2025

@DavidZagury why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?

Also, do you notice any kernel errors in syslog related to rps? Do you mind sharing the dump of cat /proc/net/softnet_stat in the good and bad cases?

Also, do you see any ksoftirqd kernel threads consuming a lot of CPU during ARP packet processing?

Hi @prabhataravind ,

Regarding to the question you rasied:
"why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?"

By default COPP limits to 1000 fps to CPU. We increasing the default limit to check real ARP learning rate targeting to 10K ARPs per second. Without COPP update we will be limited to 1K ARP/sec. The test is about ARP learning rate (performance)

@prabhataravind
Copy link
Contributor

@DavidZagury why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?

Will check with the test owner.

Also, do you notice any kernel errors in syslog related to rps? Do you mind sharing the dump of cat /proc/net/softnet_stat in the good and bad cases?

After drops have been seen:

# cat /proc/net/softnet_stat
0007d899 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e27 00000000 00000000 00000000
0008ac2c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000715c 00000000 00000000 00000001
00088a35 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e8f 00000000 00000000 00000002
00083a14 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003
0008059a 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006e35 00000000 00000000 00000004
00084732 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000698b 00000000 00000000 00000005
00098618 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00014bdf 00000000 00000000 00000006
00095a68 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00006eac 00000000 00000000 00000007

Also, do you see any ksoftirqd kernel threads consuming a lot of CPU during ARP packet processing?

I don't manage to see such increase in consumption, but - the ARP learning process is quite fast, don't know if regular tools will catch any increase in consumption that will quickly disappear since all ARPs needed has been learnt.

@DavidZagury I don't see drops in any cores based on the output you shared above (the second column indicates the per-core drop)

@prabhataravind
Copy link
Contributor

@DavidZagury why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?
Also, do you notice any kernel errors in syslog related to rps? Do you mind sharing the dump of cat /proc/net/softnet_stat in the good and bad cases?
Also, do you see any ksoftirqd kernel threads consuming a lot of CPU during ARP packet processing?

Hi @prabhataravind ,

Regarding to the question you rasied: "why are we updating the default copp policer limits? Do you see issues with the default sonic policer config as well?"

By default COPP limits to 1000 fps to CPU. We increasing the default limit to check real ARP learning rate targeting to 10K ARPs per second. Without COPP update we will be limited to 1K ARP/sec. The test is about ARP learning rate (performance)

@dprital OK, but in production use cases, we will never encounter this situation unless someone changes the default policing rate for ARP. Could you please repeat the test with the default rate and check if there are drops when learning 1K ARP/sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants