-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[portsyncd] fix default select timeout #1287
Conversation
Default select timeout was set to 1 ms, causing 3-5% CPU usage when idles. Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
Needed for 201811, 201904, 201911 and master |
lgtm, @zhenggen-xu to check since you had some changes in portsyncd for dpb? |
cout << "Get port configuration from ConfigDB..." << endl; | ||
SWSS_LOG_ENTER(); | ||
|
||
SWSS_LOG_NOTICE("Getting port configuration from ConfigDB..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this part? This is one time cout
at the beginning, no harm to cout. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SWSS_LOG_* is a usual way to do logging in SONiC, why leave it to be cout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always good to improve the code, but I feel we should keep consistency. If we are changing the style in this function that is not touched by your core logic, we should make other places in this file to be consistent. like line 112 etc. Or we could leave all these improvements in a separate PR, and address the whole file. This way it would be easier to maintain commits across branches etc.
|
||
if (ret == Select::TIMEOUT) | ||
if (temps == static_cast<Selectable*>(&netlink)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (temps == static_cast<Selectable*>(&netlink)) [](start = 12, length = 48)
The behavior is changed and suspicious? Could you focus on "fix default select timeout" in this PR? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is suspicious here? On netlink message, check wether we need to send PortInitDone. Previously this check was done every 1ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not necessarily mean your change is wrong. I mean the purpose of this PR is to "fix default select timeout".
Let's assume if you don't change the behavior here, this check will be done every 1000ms. Anything wrong? If there is a bug, I prefer fix it in a separate PR.
In reply to: 420307407 [](ancestors = 420307407)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is kind of another problem. That's why I assume the select timeout was set to very small value. If we increase select timeout to 1 sec, the check will be performed every 1 sec. This means there might be 1 sec delay in notifying system about PortInitDone. To avoid this unneccesary delay, we check if select was unblocked due to netlink message and do the PortInitDone check. So, I think that issue is strongly related to select timeout and don't see a reason to fix it in a seperate PR. I hope that sounds reasonable to fix in one PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g_portSet
is decreased by LinkSync::onMsg(). Not sure the relationship between the timing of that and selecting netlink
. Is it possible to have time risk?
In reply to: 421040960 [](ancestors = 421040960,420307407)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LinkSync::onMsg is called in NetLink::readData() in select(), so g_portSet will be updated after select()
@@ -20,6 +20,8 @@ | |||
using namespace std; | |||
using namespace swss; | |||
|
|||
#define DEFAULT_SELECT_TIMEOUT 1000 /* ms */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 [](start = 31, length = 4)
Is it too slow? How about CPU usage @200 ms? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any number of ms should be good here, I don't see a problem with 1000 ms, and this is default for other daemons. Why it is to slow? Which tasks portsyncd needs to do every DEFAULT_SELECT_TIMEOUT ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code, this number is not really determining the frequency of checking the events, since whenever you timeout, you call select again without doing anything, so it is almost always checking the events, thus no slowness concerns. The code in the review will have similar behavior with bigger timeout even with -1. But I think setting to 1 is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comments
Overall, LGTM, just had two comments. |
This change is an enhancement. I'll leave it out for 201811. |
Not taking for 201911 at this stage. |
Default select timeout was set to 1 ms, causing 3-5% CPU usage when
idles.
Signed-off-by: Stepan Blyschak stepanb@mellanox.com
What I did
I changed default select timeout from 1ms to 1000ms.
Why I did it
portsyncd is causing 3-5% CPU usage when system idles.
How I verified it
Basic sanity verification on T0 topology
Details if related