-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orchagent crashes due to transfer_attributes: src vs dst attr id don't match GET mismatch #3832
Comments
SAI header has been updated to 1.5.1. is that the reason? |
I checked the log and found the first this error took place is at the beginning of Nov. when is the sai header update merged? |
From sairedis.rec.3: since 2019-11-14.12:05:11.162217 every get started to timeout we can see that in syslog and also taht FAILURE happens after exactly 60 seconds from request, so syncd was not responding for those queries transfer_attributes only happens when response is SUCCESS or BUFFER_OVERFLOW, its not happening on FAILURES, and syslog: we got failure at Nov 14 12:15:04.356776, its less than 60 seconds, from 14.12:14:11.608540, so it seems like some response with SUCCESS arrived, so this is one thing to fix, if we look at syslog we will see a lot of entries threadFunction: new span = from which we can see how many seconds each call to SAI API takes, and some of them exceeding 30 seconds syncd: :- timerWatchdogCallback: main loop execution exceeded 30634836 ms (this is actually usec not ms) my guess here is that some previous response arrived when OA was already listening for one of the next ones since select timeout over 60 seconds, but this doesn't mean that response will not arrive, it can arrive, because syncd doesn't know that OA timed out, so it sends response that was already processing |
got it.
one question, does sairedis handle OA's request in one thread or in multiple threads? if it is handled in one thread, is it possible that log all the requests as well as the time handling it? by doing so probably we can catch the chief culprit. |
every Get message is synchronous and its processed in single thread, each api is under mutex so only 1 access is done at given time. Messages don't any have id's, so if in communication process any message will time out, and it will arrive later, every next message ids will be mismatched and throw exception |
is it possible to turn on the log of all the messages? |
you can enable INFO or DEBUG logging for sairdis/orchagent but it will NOT log message that arrived late, take a look here: https://github.com/Azure/sonic-sairedis/blob/master/lib/src/sai_redis_generic_get.cpp#L216 and this was done on purpose to actually log correct data after deserialization, since if message arrive late, for example you will try to deserialize int, and maybe message is list of oids, depends what was queried, but movind that log lave before will potentially log invalid data (arrived late messages) but this could be done as debug message |
I reproduced this issue recently. In my test, there is no SAI API takes more than 30 seconds(I don't see log "main loop execution exceeded"). Instead, my test contains a lot of short SAI API calls, and those calls are async operations, such as remove/create route entry. My test flow is like below:
IMO, there are two things to fix:
|
To handle synchronous request i a manner that when late response1 will arrive and OA will at that time expecte response2, we need to add ID to each request/response pair and only expect those. currently no ID was required, since currently all messages are processed in single thread so order of processing messages is guaranteed. and processing many route entries in OA side is being addressed by Qi, adding bulk request/create for routes and next hop group. |
Thanks for your quick reply. I agree that ASIC might need improve its performance for certain operation. Adding bulk request/create will definitely help route entries issue. But consider following case:
Adding entries to redis is relatively fast, but processing such entries in syncd probably not that fast. In this situation, step 2 is likely to timeout. And syncd doesn't know that orchagent timeout, it will probably send the response and crash orchagent. So, I am not sure if the current design can cover all cases for now and in future. |
@volodymyrsamotiy will sync with @Junchao-Mellanox . thx. |
@kcudnik , if it is a timeout, then why won't we get a get timeout message? |
now that we know almost all the reproductions of the |
If you still see this issue time to time on regression, please attache sairedis.rec.* logs and syslog.* logs from recent failure, so i could investigate and compare |
we have monitored this issue for quite some time and found it not reproducible any more. |
I reproduce this issue again. I have saved the dump file by command
But it is too big to upload(40M). I only uploaded part of them. If you need the whole dump file for debugging, I can send it via email. FW version: 13.2008.1310 SONiC version:
|
I hit the same issue on 201811 when I was using gdb to attach to syncd for debugging for a few minutes, and exist gdb later.
might be useful for reproduce the issue easily. I don't expect we should hit orchagent crash, but rather more graceful way to handle it. |
@kcudnik probably the fastest way of avoiding this issue is:
|
no, thats the wrong approach, if attribute is is wrong, then something is wrong with reply, thats whyexception is throw, there is no way to make it skip and continue, since something went wrong, and we dont know why |
@kcudnik Another case with syslog
sairedis record shared seperately. Also, if you want to live debug, using gdb to attach syncd and keep it idle for a few minutes and then exit, it should be able to reproduce the issue. |
@zhenggen-xu i analysed this case (thanks for sairedis recording) internal_redis_generic_get: generic get failed due to SELECT operation result: TIMEOUT 22:58:49.210643 lnos-x1-a-csw03 INFO swss#supervisord: orchagent what(): :- transfer_attributes: src vs dst attr id don't match GET mismatch but this is 22:58:49. 10 minutes later, so i think i know whats happened, select operation timeout, since GET was still in progress, in syncd, then orchagent sends another get with new attribute and expects answer, at that time syncd finished processing previous request, and sends response for previous query, that's why attribute id don't match and those syslogs are from orchagent only, do you have also syslog from syncd ? there should be a information log that api execution takes too long |
I corelated attached syslog.2 and sairedis.3 and this is what i found:
and here:
so in that time window syncd is processing those 50k create/set, and it's very busy, so before it starts to take GET 2019-11-14.12:00:54 from the redis queue, it's is already timed out, since there is a "lot of new span" which counts for each api executions, and if you can see some of even 1 apis takes even more than 5 seconds, so conclusion here is that vendor SAI is not fast enough to apply those changes |
In general, we would need transaction ID for GET operations so we could discard the mismatched ones due to time out, but that probably requires quite some work. For the time being, I feel we should actually increase this 1 minute timeout in sairedis to a bigger value ( it was reduced by sonic-net/sonic-sairedis#472) on branches that does not support synchronous call yet. |
What I did: Increase SAI Sync operation timeout from 1 min to 6 min by reverting #472 Why I did: Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2 which results in sequence of this operation on T1 topology:- Link Going down - Next Hop Member removal from Nexthop Group - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Link Going up - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. To fix this:- - Have increase the timeout to 6 min. Even 201811 image/branch is having same timeout value. - Hard to predict what is good timeout value but based on below experiment and since 201811 is having 6 min using that value. How I verify: - Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps. - Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. ``` #!/bin/bash j=0 while [ True ]; do echo "iteration $j" config interface shutdown PortChannel0002 sleep 2 config interface startup PortChannel0002 sleep 2 j=$((j+1)) done ``` ``` admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec* sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec:-- sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.1:-- sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533 sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.2.gz:-- sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532 ``` Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
What I did: Increase SAI Sync operation timeout from 1 min to 6 min by reverting sonic-net#472 Why I did: Issue as mention sonic-net/sonic-buildimage#3832 is seen when there is link/port-channel flaps towards T2 which results in sequence of this operation on T1 topology:- Link Going down - Next Hop Member removal from Nexthop Group - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Link Going up - Creation of New Nexthop Group - All the Routes towards T2 are given SET Operation with new Nexthop Group (6000+ routes) Above Sequence with flaps create many SAI operation of CREATE/SET and if SAI is slow to process them if there in any intermediate Get operation (common GET case being CRM polling at default 5 mins) will result timeout if there is no response in 1 min which will cause next Get operation to fail because of attribute mismatch as it make Request and Reponse are out of sync. To fix this:- - Have increase the timeout to 6 min. Even 201811 image/branch is having same timeout value. - Hard to predict what is good timeout value but based on below experiment and since 201811 is having 6 min using that value. How I verify: - Ran the below script for 800 iteration and orchagent did not got timeout error and no crash. Without fix OA used to crash consistently with most of the time less than 100 iteration of flaps. - Based on sairedis.rec of the first CRM `GET` operation between `Create/Set` operation in these 800 iteration worst time for GET response was see 4+ min. ``` #!/bin/bash j=0 while [ True ]; do echo "iteration $j" config interface shutdown PortChannel0002 sleep 2 config interface startup PortChannel0002 sleep 2 j=$((j+1)) done ``` ``` admin@str-xxxx-06:/var/log/swss$ sudo zgrep -A 1 "|g|.*AVAILABLE_IPV4_ROUTE" sairedis.rec* sairedis.rec:2021-05-04.17:40:13.431643|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:18.132115|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec:-- sairedis.rec:2021-05-04.17:46:21.128261|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec:2021-05-04.17:46:31.258628|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52530 sairedis.rec.1:2021-05-04.17:30:13.900135|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:34:47.448092|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.1:-- sairedis.rec.1:2021-05-04.17:35:12.827247|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.1:2021-05-04.17:35:32.939798|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52533 sairedis.rec.2.gz:2021-05-04.17:20:13.103322|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:24:47.796720|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52531 sairedis.rec.2.gz:-- sairedis.rec.2.gz:2021-05-04.17:25:15.073981|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=1538995696 sairedis.rec.2.gz:2021-05-04.17:26:08.525235|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY=52532 ``` Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Description
On SONiC, orchagent crashes sporadically with the error message "orchagent failed due to transfer_attributes: src vs dst attr id don't match GET mismatch".
It looks like a sai-redis issue. I guess the flow is like someone tried fetching attribute SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM to sai-redis but the sai-redis was busy handling some other request or stuck when handling this request and therefore blocking all other requests; later when crm tried fetching some counters the SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM returned, causing mismatch error.
Steps to reproduce the issue:
With crm and counters is enabled this issue can take place sporadically.
Describe the results you received:
Error logs
coredump analysis
Most of the coredumps share the same characteristics:
Assumptions
The following is my assuption (which is not proved).
Someone was trying fetching attr SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM but somehow it took sairedis very long time to got the result, causing the fetching procedure timeout and return.
Just after that the CrmOrch tried fetching counters via redis. The first some fetching still failed. and suddenly the previously issued SAI_SWITCH_ATTR_EGRESS_BUFFER_POOL_NUM returned while CrmOrch was fetching some other counters, causing it fail.
I think it's something like the following sequence chart
Coredump details
I analyzed 3 coredumps. They're similar
1.
Describe the results you expected:
No errors in log and no crash in orchagent.
Additional information you deem important (e.g. issue happens only occasionally):
SONiC Software Version: SONiC.HEAD.135-5e6f8adb
Distribution: Debian 9.11
Kernel: 4.9.0-9-2-amd64
Build commit: 5e6f8ad
Build date: Wed Nov 27 08:28:41 UTC 2019
Built by: johnar@jenkins-worker-4
Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
Serial Number: MT1848K10623
Uptime: 17:50:24 up 6 min, 0 users, load average: 4.90, 4.84, 2.44
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-syncd-mlnx HEAD.135-5e6f8adb d366b0df8d28 373MB
docker-syncd-mlnx latest d366b0df8d28 373MB
docker-fpm-frr HEAD.135-5e6f8adb b33bc6d11daa 321MB
docker-fpm-frr latest b33bc6d11daa 321MB
docker-sflow HEAD.135-5e6f8adb 1b7f547ef006 305MB
docker-sflow latest 1b7f547ef006 305MB
docker-lldp-sv2 HEAD.135-5e6f8adb 3336bf112187 299MB
docker-lldp-sv2 latest 3336bf112187 299MB
docker-dhcp-relay HEAD.135-5e6f8adb 509dccee71ec 289MB
docker-dhcp-relay latest 509dccee71ec 289MB
docker-database HEAD.135-5e6f8adb 8fa2a47bab7a 281MB
docker-database latest 8fa2a47bab7a 281MB
docker-snmp-sv2 HEAD.135-5e6f8adb 86b9afca830b 335MB
docker-snmp-sv2 latest 86b9afca830b 335MB
docker-orchagent HEAD.135-5e6f8adb 6d6a3c9344a0 322MB
docker-orchagent latest 6d6a3c9344a0 322MB
docker-teamd HEAD.135-5e6f8adb 1ec23d161745 304MB
docker-teamd latest 1ec23d161745 304MB
docker-sonic-telemetry HEAD.135-5e6f8adb c1d93fb4f5c6 304MB
docker-sonic-telemetry latest c1d93fb4f5c6 304MB
docker-router-advertiser HEAD.135-5e6f8adb 1cbfbdce9acd 281MB
docker-router-advertiser latest 1cbfbdce9acd 281MB
docker-platform-monitor HEAD.135-5e6f8adb 3fb6f774b03c 565MB
docker-platform-monitor latest 3fb6f774b03c 565
sairedis.rec.3.gz
sairedis.rec.2.gz
logs.tar.gz
syslog.2.gz
The text was updated successfully, but these errors were encountered: