Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201911] Increase SAI Sync operation timeout from 1 min to 6 min. #828

Merged
merged 1 commit into from
May 5, 2021

Conversation

abdosi
Copy link
Contributor

@abdosi abdosi commented May 4, 2021

What I did:
Increase SAI Sync operation timeout from 1 min to 6 min by reverting #472

Why I did:
Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2 which results in sequence of this operation on T1 topology:-

Link Going down

  • Next Hop Member removal from Nexthop Group
  • Creation of New Nexthop Group
  • All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Link Going up

  • Creation of New Nexthop Group
  • All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync.

To fix this:-

  • Have increase the timeout to 6 min. Even 201811 image/branch is having same timeout value.

  • Hard to predict what is good timeout value but based on below experiment and since 201811 is having 6 min using that value.

How I verify:

  • Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps.
  • Based on sairedis.rec of the first CRM GET operation between Create/Set operation in these 800 iteration worst time for GET response was see 4+ min.
#!/bin/bash

j=0
while [ True ]; do
        echo "iteration $j"
        config interface shutdown PortChannel0002
        sleep 2
        config interface startup PortChannel0002
        sleep 2
        j=$((j+1))
done
admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec*
sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec:--
sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.1:--
sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533
sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.2.gz:--
sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532

@abdosi abdosi added the Request for 201911 Branch Label for changes in master that applies for 201911 Branch also label May 4, 2021
@abdosi abdosi requested review from kcudnik, lguohan and rlhui May 4, 2021 18:57
@lguohan
Copy link
Contributor

lguohan commented May 4, 2021

we should not need this for master branch. only for 201911 branch.

@qiluo-msft , let's plan for limiting the batch size on 202012 and master branch where sync calls are executed.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
@abdosi abdosi changed the base branch from master to 201911 May 4, 2021 23:07
@abdosi
Copy link
Contributor Author

abdosi commented May 4, 2021

we should not need this for master branch. only for 201911 branch.

@qiluo-msft , let's plan for limiting the batch size on 202012 and master branch where sync calls are executed.

@lguohan Updated PR target to 201911 branch.

@abdosi abdosi removed the Request for 201911 Branch Label for changes in master that applies for 201911 Branch also label May 4, 2021
@abdosi abdosi changed the title Increase SAI Sync operation timeout from 1 min to 6 min. [201911] Increase SAI Sync operation timeout from 1 min to 6 min. May 4, 2021
@qiluo-msft
Copy link
Contributor

ACK


In reply to: 832297988

@lguohan lguohan merged commit e438b0d into sonic-net:201911 May 5, 2021
@abdosi
Copy link
Contributor Author

abdosi commented May 18, 2021

@liat-grozovik

Please use this PR description to validate SAI improvement

kktheballer pushed a commit to kktheballer/sonic-sairedis that referenced this pull request Jul 21, 2021
What I did:
Increase SAI Sync operation timeout from 1 min to 6 min by reverting sonic-net#472

Why I did:
Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2  which results in sequence of this operation on T1 topology:-

Link Going down
- Next Hop Member removal from Nexthop Group
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Link Going up
- Creation of New Nexthop Group
- All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes)

Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. 

To fix this:-

- Have increase the timeout to 6 min.  Even 201811 image/branch is having same timeout value. 

- Hard to predict what is good timeout value but based on  below experiment and since 201811 is having 6 min using that value.

How I verify:

- Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps.
- Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. 
```
#!/bin/bash

j=0
while [ True ]; do
        echo "iteration $j"
        config interface shutdown PortChannel0002
        sleep 2
        config interface startup PortChannel0002
        sleep 2
        j=$((j+1))
done
```


```
admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec*
sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec:--
sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530
sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.1:--
sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533
sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531
sairedis.rec.2.gz:--
sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696
sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532
```


Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants