Skip to content

Commit

Permalink
Updated network retry delay strategy to scale
Browse files Browse the repository at this point in the history
This allows for network retries, to scale well with the
number of machines, and still retains the existing functionality
for cases with smaller num_machines ( 500 )

Fixes #3301
  • Loading branch information
Aakarsh Gopi committed Aug 14, 2020
1 parent 69a2691 commit 77eb466
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 2 deletions.
2 changes: 1 addition & 1 deletion include/LightGBM/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ typedef void* FastConfigHandle; /*!< \brief Handle of FastConfig. */
LIGHTGBM_C_EXPORT const char* LGBM_GetLastError();

/*!
* \brief Register a callback function for log redirecting.
* \brief Register a callback function for log redirecting.
* \param callback The callback function to register
* \return 0 when succeed, -1 when failure happens
*/
Expand Down
5 changes: 4 additions & 1 deletion src/network/linkers_socket.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <algorithm>

#include "linkers.h"

Expand Down Expand Up @@ -186,7 +187,9 @@ void Linkers::Construct() {
listener_->SetTimeout(socket_timeout_);
listener_->Listen(incoming_cnt);
std::thread listen_thread(&Linkers::ListenThread, this, incoming_cnt);
const int connect_fail_retry_cnt = 20;
const int connect_fail_retries_factor_machine = 25;
const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / connect_fail_retries_factor_machine);
const int connect_fail_retry_cnt = max(20, connect_fail_retries_scale_factor);
const int connect_fail_retry_first_delay_interval = 200; // 0.2 s
const float connect_fail_retry_delay_factor = 1.3f;
// start connect
Expand Down

0 comments on commit 77eb466

Please sign in to comment.