-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[warmboot] Recovery path failed when syncd hits "Runtime error: :- translateVidToRid" #8722
Comments
the earliest issue i see is this one.
|
this test. the image is
|
Fix #8722 retreat two commits which cause warm reboot regression * eb79ca4 2021-09-01 | [pbh]: Add PBH OA (#1782) [Nazarii Hnydyn] * 3d6b1f0 2021-08-31 | [buffer orch] Bugfix: Don't query counter SAI_BUFFER_POOL_STAT_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857) [Stephen Sun] Signed-off-by: Guohan Lu <lguohan@gmail.com>
@stephenxs FYI |
is this whole log? uncutted? this seems terribly wrong if only buffer pools are discovered, there should be whole list of other objects as well |
this is caused by clear_stats on buffer profile during init view mode |
This PR: sonic-net/sonic-swss#1857 introduced the call Kamil mentioned above. |
Hi @yxieca |
I added PR that should put nice message and fail in OA on those apis: sonic-net/sonic-sairedis#930 |
…T_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857)" (#1945) This reverts commit 3d6b1f0. Fix sonic-net/sonic-buildimage#8893 What I did This commit had earlier caused issue on master image warmboot - sonic-net/sonic-buildimage#8722 To fix this issue, this PR was created to retreat sonic-swss head on buildimage - sonic-net/sonic-buildimage#8732 Now, this commit was again pulled into sonic-buildimage as part of sonic-swss submodule advance: sonic-net/sonic-buildimage#8839 And, warm-reboot again broke for the same reason. This change is so that any other swss submodule update on buildimage will not fail warmboot again.
…T_XOFF_ROOM_WATERMARK_BYTES on a pool where it is not supported (#1857)" (#1945) This reverts commit 3d6b1f0. Fix sonic-net/sonic-buildimage#8893 What I did This commit had earlier caused issue on master image warmboot - sonic-net/sonic-buildimage#8722 To fix this issue, this PR was created to retreat sonic-swss head on buildimage - sonic-net/sonic-buildimage#8732 Now, this commit was again pulled into sonic-buildimage as part of sonic-swss submodule advance: sonic-net/sonic-buildimage#8839 And, warm-reboot again broke for the same reason. This change is so that any other swss submodule update on buildimage will not fail warmboot again.
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
Hi |
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
Description
Syncd sends switch shutdown request to OA after "Runtime error: :- translateVidToRid: unable to get RID for VID".
Orchagent, cannot process this request gracefully, while warm recovery is in progress, and leads to SIGSEGV error with core.
Steps to reproduce the issue:
Describe the results you received:
SAI redis:
Describe the results you expected:
Output of
show version
:SAI version:
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: