Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[portsyncd] fix default select timeout #1287

Merged
merged 1 commit into from
May 13, 2020

Conversation

stepanblyschak
Copy link
Contributor

Default select timeout was set to 1 ms, causing 3-5% CPU usage when
idles.

Signed-off-by: Stepan Blyschak stepanb@mellanox.com

What I did

I changed default select timeout from 1ms to 1000ms.

Why I did it

portsyncd is causing 3-5% CPU usage when system idles.

How I verified it

Basic sanity verification on T0 topology

Details if related

Default select timeout was set to 1 ms, causing 3-5% CPU usage when
idles.

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
@stepanblyschak
Copy link
Contributor Author

Needed for 201811, 201904, 201911 and master

@prsunny
Copy link
Collaborator

prsunny commented May 4, 2020

lgtm, @zhenggen-xu to check since you had some changes in portsyncd for dpb?

cout << "Get port configuration from ConfigDB..." << endl;
SWSS_LOG_ENTER();

SWSS_LOG_NOTICE("Getting port configuration from ConfigDB...");
Copy link
Contributor

@qiluo-msft qiluo-msft May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this part? This is one time cout at the beginning, no harm to cout. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWSS_LOG_* is a usual way to do logging in SONiC, why leave it to be cout?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always good to improve the code, but I feel we should keep consistency. If we are changing the style in this function that is not touched by your core logic, we should make other places in this file to be consistent. like line 112 etc. Or we could leave all these improvements in a separate PR, and address the whole file. This way it would be easier to maintain commits across branches etc.


if (ret == Select::TIMEOUT)
if (temps == static_cast<Selectable*>(&netlink))
Copy link
Contributor

@qiluo-msft qiluo-msft May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (temps == static_cast<Selectable*>(&netlink)) [](start = 12, length = 48)

The behavior is changed and suspicious? Could you focus on "fix default select timeout" in this PR? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is suspicious here? On netlink message, check wether we need to send PortInitDone. Previously this check was done every 1ms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It not necessarily mean your change is wrong. I mean the purpose of this PR is to "fix default select timeout".

Let's assume if you don't change the behavior here, this check will be done every 1000ms. Anything wrong? If there is a bug, I prefer fix it in a separate PR.


In reply to: 420307407 [](ancestors = 420307407)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is kind of another problem. That's why I assume the select timeout was set to very small value. If we increase select timeout to 1 sec, the check will be performed every 1 sec. This means there might be 1 sec delay in notifying system about PortInitDone. To avoid this unneccesary delay, we check if select was unblocked due to netlink message and do the PortInitDone check. So, I think that issue is strongly related to select timeout and don't see a reason to fix it in a seperate PR. I hope that sounds reasonable to fix in one PR

Copy link
Contributor

@qiluo-msft qiluo-msft May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g_portSet is decreased by LinkSync::onMsg(). Not sure the relationship between the timing of that and selecting netlink. Is it possible to have time risk?


In reply to: 421040960 [](ancestors = 421040960,420307407)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkSync::onMsg is called in NetLink::readData() in select(), so g_portSet will be updated after select()

@@ -20,6 +20,8 @@
using namespace std;
using namespace swss;

#define DEFAULT_SELECT_TIMEOUT 1000 /* ms */
Copy link
Contributor

@qiluo-msft qiluo-msft May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 [](start = 31, length = 4)

Is it too slow? How about CPU usage @200 ms? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any number of ms should be good here, I don't see a problem with 1000 ms, and this is default for other daemons. Why it is to slow? Which tasks portsyncd needs to do every DEFAULT_SELECT_TIMEOUT ms?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, this number is not really determining the frequency of checking the events, since whenever you timeout, you call select again without doing anything, so it is almost always checking the events, thus no slowness concerns. The code in the review will have similar behavior with bigger timeout even with -1. But I think setting to 1 is fine.

@qiluo-msft qiluo-msft requested a review from zhenggen-xu May 5, 2020 18:09
Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments

@zhenggen-xu
Copy link
Collaborator

Overall, LGTM, just had two comments.

@qiluo-msft qiluo-msft merged commit 727a518 into sonic-net:master May 13, 2020
@abdosi
Copy link
Contributor

abdosi commented May 28, 2020

@lguohan @rlhui

Do we need this for 201911 ?

@yxieca
Copy link
Contributor

yxieca commented Jul 9, 2020

This change is an enhancement. I'll leave it out for 201811.

@abdosi
Copy link
Contributor

abdosi commented Aug 9, 2020

Not taking for 201911 at this stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants