-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated network retry delay strategy to scale #3306
Updated network retry delay strategy to scale #3306
Conversation
77eb466
to
e5dbedd
Compare
src/network/linkers_socket.cpp
Outdated
const int connect_fail_retry_cnt = 20; | ||
const int connect_fail_retries_factor_machine = 25; | ||
const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / connect_fail_retries_factor_machine); | ||
const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is 20 when num_machines_ = 500
, Is this too small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's 20 for num_machines_
less than 500, correct. It's too small if we're spinning say > 1000 num_machines_
, in which case time to spin them up and get them running itself takes a while ( especially when we launch them as containers say in kubernetes cluster )
e5dbedd
to
00b881a
Compare
@StrikerRUS and @guolinke can you PTAL at the PR again when you get a chance? Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a cpp reviewer, but the concept of this PR looks OK to me!
Just wonder, why do we need two different values 20
and 25
?
@StrikerRUS maybe @aakarshg can have better naming for these variates. |
@StrikerRUS I put in 2 different values 20 and 25, for following reasons:
But if we are more than 525 machines then the number of retries will be 21 and then scale up accordingly as we increase num_machines. Hope that explains things. About the number of nodes being usually power of 2, that's not true tbh. There are cases ( can't quite go into detail ) where num_machines is more tightly died to number of files needed to train data on. I was looking at around 900 or so machines :) |
@aakarshg |
Why can't number of machines be divided by |
I can do that, but then the current behavior will only upto 400 machines, as 400/20 will be the current number of 20 retries. If that's okay then I'll update the PR to just divide by 20 and make code more accessible |
I find both 400 and 500 machines in cluster very big number, IMHO. So I don't see any difference between 400 and 500. Is 500 something special threshold? |
00b881a
to
615d265
Compare
nothing really, its more of an opinion.
agreed, updated the PR. PTAL again thanks :) |
615d265
to
8389be8
Compare
src/network/linkers_socket.cpp
Outdated
const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / 20); | ||
const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aakarshg Thank you for addressing comments! I'm not sure but maybe it will be better to store 20
in a constant? If we will need to update this value, that constant will allow us to do it in only one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @aakarshg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gently ping @aakarshg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that'll be absolutely okay, i'll update the pr :) and apologies for late replies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
8389be8
to
4963ec2
Compare
This allows for network retries, to scale well with the number of machines, and still retains the existing functionality for cases with smaller num_machines ( 500 ) Fixes microsoft#3301
4963ec2
to
0b002e1
Compare
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This allows for network retries, to scale well with the
number of machines, and still retains the existing functionality
for cases with smaller num_machines ( 500 )
Fixes #3301