Updated network retry delay strategy to scale #3306

aakarshg · 2020-08-14T19:11:43Z

This allows for network retries, to scale well with the
number of machines, and still retains the existing functionality
for cases with smaller num_machines ( 500 )

Fixes #3301

ghost · 2020-08-14T19:12:01Z

All CLA requirements met.

guolinke · 2020-08-15T08:50:20Z

src/network/linkers_socket.cpp

-  const int connect_fail_retry_cnt = 20;
+  const int connect_fail_retries_factor_machine = 25;
+  const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / connect_fail_retries_factor_machine);
+  const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor);


so it is 20 when num_machines_ = 500, Is this too small?

it's 20 for num_machines_ less than 500, correct. It's too small if we're spinning say > 1000 num_machines_, in which case time to spin them up and get them running itself takes a while ( especially when we launch them as containers say in kubernetes cluster )

src/network/linkers_socket.cpp

aakarshg · 2020-08-18T21:56:09Z

@StrikerRUS and @guolinke can you PTAL at the PR again when you get a chance? Thanks :)

StrikerRUS

I'm not a cpp reviewer, but the concept of this PR looks OK to me!
Just wonder, why do we need two different values 20 and 25?

guolinke · 2020-08-20T01:57:47Z

@StrikerRUS maybe @aakarshg can have better naming for these variates.
@aakarshg to me, the 25 is strange too. The number of nodes in distributed learning usually is the power of 2.

aakarshg · 2020-08-20T23:06:59Z

@StrikerRUS I put in 2 different values 20 and 25, for following reasons:

20 is the current max number of retries, and i wanted to maintain the same upto the scale of 500 nodes so that it'll still fail early at small scale.
I'm dividing number of machines by 25, so that upto 500 nodes ... (500/25) the maximum value of (num_machines/25, 20 ) will be 25 only.

But if we are more than 525 machines then the number of retries will be 21 and then scale up accordingly as we increase num_machines.

Hope that explains things.

About the number of nodes being usually power of 2, that's not true tbh. There are cases ( can't quite go into detail ) where num_machines is more tightly died to number of files needed to train data on. I was looking at around 900 or so machines :)

guolinke · 2020-08-20T23:29:21Z

@aakarshg
LightGBM uses many collective communication algorithms, which work better with power of 2 machines.

StrikerRUS · 2020-08-21T12:02:38Z

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

aakarshg · 2020-08-21T13:01:49Z

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

I can do that, but then the current behavior will only upto 400 machines, as 400/20 will be the current number of 20 retries. If that's okay then I'll update the PR to just divide by 20 and make code more accessible

StrikerRUS · 2020-08-21T17:26:02Z

I find both 400 and 500 machines in cluster very big number, IMHO. So I don't see any difference between 400 and 500. Is 500 something special threshold?

aakarshg · 2020-08-27T23:18:20Z

I find both 400 and 500 machines in cluster very big number, IMHO. So I don't see any difference between 400 and 500. Is 500 something special threshold?

nothing really, its more of an opinion.

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

agreed, updated the PR. PTAL again thanks :)

StrikerRUS · 2020-08-28T17:52:07Z

src/network/linkers_socket.cpp

+  const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / 20);
+  const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor);


@aakarshg Thank you for addressing comments! I'm not sure but maybe it will be better to store 20 in a constant? If we will need to update this value, that constant will allow us to do it in only one place.

ping @aakarshg

gently ping @aakarshg

yes that'll be absolutely okay, i'll update the pr :) and apologies for late replies.

This allows for network retries, to scale well with the number of machines, and still retains the existing functionality for cases with smaller num_machines ( 500 ) Fixes microsoft#3301

github-actions · 2023-08-24T04:40:21Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

aakarshg requested review from btrotta, chivee and guolinke as code owners August 14, 2020 19:11

aakarshg force-pushed the scale_network_retry_delay branch 3 times, most recently from 77eb466 to e5dbedd Compare August 14, 2020 21:40

StrikerRUS added the feature label Aug 14, 2020

guolinke reviewed Aug 15, 2020

View reviewed changes

StrikerRUS reviewed Aug 15, 2020

View reviewed changes

src/network/linkers_socket.cpp Outdated Show resolved Hide resolved

aakarshg force-pushed the scale_network_retry_delay branch from e5dbedd to 00b881a Compare August 17, 2020 12:47

aakarshg requested review from guolinke and StrikerRUS August 17, 2020 17:16

aakarshg mentioned this pull request Aug 17, 2020

[WIP] next release (3.0.0) #3293

Closed

10 tasks

guolinke approved these changes Aug 19, 2020

View reviewed changes

StrikerRUS reviewed Aug 19, 2020

View reviewed changes

aakarshg requested a review from StrikerRUS August 20, 2020 23:07

aakarshg force-pushed the scale_network_retry_delay branch from 00b881a to 615d265 Compare August 27, 2020 23:17

aakarshg force-pushed the scale_network_retry_delay branch from 615d265 to 8389be8 Compare August 27, 2020 23:34

StrikerRUS reviewed Aug 28, 2020

View reviewed changes

aakarshg force-pushed the scale_network_retry_delay branch from 8389be8 to 4963ec2 Compare October 16, 2020 16:55

Updated network retry delay strategy to scale

0b002e1

This allows for network retries, to scale well with the number of machines, and still retains the existing functionality for cases with smaller num_machines ( 500 ) Fixes microsoft#3301

aakarshg force-pushed the scale_network_retry_delay branch from 4963ec2 to 0b002e1 Compare October 16, 2020 23:04

StrikerRUS merged commit c0c65f7 into microsoft:master Oct 17, 2020

aakarshg deleted the scale_network_retry_delay branch October 19, 2020 13:46

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated network retry delay strategy to scale #3306

Updated network retry delay strategy to scale #3306

aakarshg commented Aug 14, 2020

ghost commented Aug 14, 2020 •

edited by ghost

Loading

guolinke Aug 15, 2020

aakarshg Aug 17, 2020

aakarshg commented Aug 18, 2020

StrikerRUS left a comment

guolinke commented Aug 20, 2020

aakarshg commented Aug 20, 2020

guolinke commented Aug 20, 2020

StrikerRUS commented Aug 21, 2020 •

edited

Loading

aakarshg commented Aug 21, 2020

StrikerRUS commented Aug 21, 2020

aakarshg commented Aug 27, 2020

StrikerRUS Aug 28, 2020

guolinke Sep 2, 2020

jameslamb Sep 27, 2020

aakarshg Oct 16, 2020

StrikerRUS Oct 17, 2020

github-actions bot commented Aug 24, 2023

		const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / 20);
		const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor);

Updated network retry delay strategy to scale #3306

Updated network retry delay strategy to scale #3306

Conversation

aakarshg commented Aug 14, 2020

ghost commented Aug 14, 2020 • edited by ghost Loading

guolinke Aug 15, 2020

Choose a reason for hiding this comment

aakarshg Aug 17, 2020

Choose a reason for hiding this comment

aakarshg commented Aug 18, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

guolinke commented Aug 20, 2020

aakarshg commented Aug 20, 2020

guolinke commented Aug 20, 2020

StrikerRUS commented Aug 21, 2020 • edited Loading

aakarshg commented Aug 21, 2020

StrikerRUS commented Aug 21, 2020

aakarshg commented Aug 27, 2020

StrikerRUS Aug 28, 2020

Choose a reason for hiding this comment

guolinke Sep 2, 2020

Choose a reason for hiding this comment

jameslamb Sep 27, 2020

Choose a reason for hiding this comment

aakarshg Oct 16, 2020

Choose a reason for hiding this comment

StrikerRUS Oct 17, 2020

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

ghost commented Aug 14, 2020 •

edited by ghost

Loading

StrikerRUS commented Aug 21, 2020 •

edited

Loading