Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix distributed initialization #285

Closed
wants to merge 1 commit into from

Conversation

ZiyueHuang
Copy link
Contributor

@ZiyueHuang ZiyueHuang commented Aug 5, 2020

According to this line https://github.com/bytedance/byteps/blob/master/byteps/server/server.cc#L197, the server will initialize the stored parameter from the last init push which could be sent from an arbitrary worker.
cc @eric-haibin-lin

@@ -221,8 +221,6 @@ def _init_params(self):
param_arrays = param._check_and_get(param._data, list)
idx = self._param2idx[param.name]

if rank() != self.root_rank:
param_arrays[0].__imul__(0)
byteps_push_pull(param_arrays[0], version=0, priority=0,
name="parameter_" + str(idx), is_average=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should use is_average=True here?

Copy link
Contributor Author

@ZiyueHuang ZiyueHuang Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to average, since the server just selects one received tensor and copy it to the stored buffer then send it to all workers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

@ZiyueHuang
Copy link
Contributor Author

ZiyueHuang commented Aug 5, 2020

For sync mode, this initialization is also necessary since it make sure that the parameters on each worker are the same at the beginning. The current implementation is problematic since all the workers may actually initialize the parameters to 0 (except when the root node sends the last init push), no matter what initializer the user chooses.

@ZiyueHuang
Copy link
Contributor Author

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

@jasperzhong
Copy link
Contributor

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

I don't think there is such a problem in the first round of pushpull of gradients. The stored has been allocated and initialized during the pushpull of parameters.

BTW, there is already an API called bps.init().

jasperzhong pushed a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020
@ZiyueHuang
Copy link
Contributor Author

ZiyueHuang commented Aug 5, 2020

@vycezhong The gradient and the weight for the same parameter are two different buffers in the server since they have different keys. See https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L203-L206. The problem is that for a new key, in the first round of byteps_push_pull, the server is not performing the sum operation and this is weird.

bps.init() is not for a particular parameter; see kvstore init api in MXNet.

@jasperzhong
Copy link
Contributor

jasperzhong commented Aug 5, 2020

i see. you're right. I am sorry that i didn't notice parameter and gradient do not share the same buffer...

I think maybe we can reuse the parameter buffer? for example,

        for i, param in enumerate(self._params):
            byteps_declare_tensor("tensor_" + str(i)) 
...

    def _init_params(self):
                ....
                byteps_push_pull(param_arrays[0], version=0, priority=0,
                                 name="tensor_" + str(idx), is_average=False)

in this way, there is only one collection of buffers. And byteps_push_pull in _init_params will trigger initialization of the buffer. And then this buffer is used for gradients. I think it may be a solution.

@ZiyueHuang
Copy link
Contributor Author

This might be a solution for training the model, but it is still weird that the first round of byteps_push_pull performs a random selection instead of the sum...

Furthermore, it may not be neccesary to maintain two buffers update_buf_ and store_ in https://github.com/bytedance/byteps/blob/master/byteps/server/server.h, since BytePS transmits gradients bidirectionally (both from worker to server and from server to worker). MXNet KVStore maintains these two buffers since MXNet pushes gradient from worker to server, then updates the weight on the server by the optimizer, then pulls weight from server to worker.

@ZiyueHuang ZiyueHuang closed this Aug 5, 2020
@eric-haibin-lin
Copy link
Collaborator

@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 5, 2020

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020
@jasperzhong
Copy link
Contributor

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

Sorry for my careless. We found the logic is right.

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants