-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix distributed initialization #285
Conversation
@@ -221,8 +221,6 @@ def _init_params(self): | |||
param_arrays = param._check_and_get(param._data, list) | |||
idx = self._param2idx[param.name] | |||
|
|||
if rank() != self.root_rank: | |||
param_arrays[0].__imul__(0) | |||
byteps_push_pull(param_arrays[0], version=0, priority=0, | |||
name="parameter_" + str(idx), is_average=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we should use is_average=True
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to average, since the server just selects one received tensor and copy it to the stored buffer then send it to all workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
For sync mode, this initialization is also necessary since it make sure that the parameters on each worker are the same at the beginning. The current implementation is problematic since all the workers may actually initialize the parameters to 0 (except when the root node sends the last init push), no matter what initializer the user chooses. |
This problem also exists for the distributed gradient accumulation in the first round... For the |
I don't think there is such a problem in the first round of pushpull of gradients. The BTW, there is already an API called |
@vycezhong The gradient and the weight for the same parameter are two different buffers in the server since they have different keys. See https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L203-L206. The problem is that for a new key, in the first round of
|
i see. you're right. I am sorry that i didn't notice parameter and gradient do not share the same buffer... I think maybe we can reuse the parameter buffer? for example, for i, param in enumerate(self._params):
byteps_declare_tensor("tensor_" + str(i))
...
def _init_params(self):
....
byteps_push_pull(param_arrays[0], version=0, priority=0,
name="tensor_" + str(idx), is_average=False) in this way, there is only one collection of buffers. And |
This might be a solution for training the model, but it is still weird that the first round of Furthermore, it may not be neccesary to maintain two buffers |
Init tensor comes from the root rank only: https://github.com/bytedance/byteps/blob/master/byteps/common/operations.cc#L331-L341 |
Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic. |
Sorry for my careless. We found the logic is right. |
This reverts commit a692fea.
According to this line https://github.com/bytedance/byteps/blob/master/byteps/server/server.cc#L197, the server will initialize the stored parameter from the last init push which could be sent from an arbitrary worker.
cc @eric-haibin-lin