-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warm reboot: How the FlexCounter was handled after init view? #862
Comments
Because there is no RID in INIT_VIEW, for objects which VID may change between different boot, does it mean the orchagent can't do the following for them?
|
hey, i don't exactly know how flex counters are handled in orchagent, but subscribe for flex counters should be done after apply view, and it does not matter whether this is cold or warm boot.
|
But currently, in OA warmboot process, |
this is design problem on OA and should be addressed on OA side, sairedis/syncd will not know which RID/VID is which before apply view in warm boot mode |
@lguohan could you pls help to check this issue? |
|
@stephenxs can you please have a look ? |
Any update for this issue? |
I'm checking and will share any findings once I have them. |
Any progress? |
There is no guarantee that the OIDs of buffer pools keep unchanged across warm reboots. It will always regenerate the counter ID list for buffer pools by calling The issue is that the orchagent should NOT clear counters before When the system starts, From the attached
Even though there is no guarantee that Can you check in your platform
In case the logic exists but
|
@stephenxs You mentioned
Actually it is possible. The orchagent warm-reboot includes below steps. (check the code in
So you could take advantage of Step 1 and Step 4 to make sure the guarantee. |
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
Guys, can we close this bug as it has been resolved by sonic-net/sonic-swss#1987? |
) - What I did Don't handle buffer pool watermark during warm reboot reconciling - Why I did it This is to fix the community issue sonic-net/sonic-sairedis#862 and sonic-net/sonic-buildimage#8722 - How I verified it Perform a warm reboot. Check whether buffer pool watermark handling is skipped during reconciling and handled after it. other watermark handling is handled during reconciling as it was before. Details if related The warm reboot flow is like this: System starts. Orchagent fetches the items from database stored before warm reboot and pushes them into m_toSync of all orchagents. This is done by bake, which can be overridden by sub orchagent. All sub orchagents handle the items in m_toSync. At this point, any notification from redis-db is blocked. Warm reboot converges. Orchagent starts to handle notifications from redis-db. The fix is like this: in FlexCounterOrch::bake. the buffer pool watermark handling is skipped. Signed-off-by: Stephen Sun <stephens@nvidia.com>
Description
With the buffer pool in config db, then do warm reboot, the syncd can't handle the FlexCounter with new VID, in "processFlexCounterEvent".
Even after apply view, the syncd should not be able to get the counter for this new VID.
Steps to reproduce the issue:
config db:
"BUFFER_POOL": { "egress_pool": { "mode": "dynamic", "size": "100000", "type": "egress" } },
Describe the results you received:
syslog after warmboot:
Jul 15 07:40:46.978278 sonic ERR syncd#syncd: :- translateVidToRid: unable to get RID for VID oid:0x18000000000647
Jul 15 07:40:46.978385 sonic WARNING syncd#syncd: :- processFlexCounterEvent: port VID oid:0x18000000000647, was not found (probably port was removed/splitted) and will remove from counters now
Jul 15 07:40:46.978474 sonic NOTICE syncd#syncd: :- removeBufferPool: Trying to remove nonexisting buffer pool 0x18000000000647 from flex counter BUFFER_POOL_WATERMARK_STAT_COUNTER
Describe the results you expected:
Error free warmboot, and buffer pool counter should be collected correctly by Flexcounter after warmboot.
Additional information you deem important (e.g. issue happens only occasionally):
Attach sai rec
sairedis.rec.zip
The buffer pool VID before warm reboot is 0x180000000005cd.
After warm reboot, in init view, the buffer pool VID is 0x18000000000647, the orchagent attach the flex counter to the new VID, but syncd failed because no RID yet.
Even after apply view, the syncd should not be able to get the counter for this new VID 0x180000000005cd.
Because the processFlexCounterEvent will not be called for this new VID after apply view.
The text was updated successfully, but these errors were encountered: