fix distributed initialization #285

ZiyueHuang · 2020-08-05T07:23:53Z

According to this line https://github.com/bytedance/byteps/blob/master/byteps/server/server.cc#L197, the server will initialize the stored parameter from the last init push which could be sent from an arbitrary worker.
cc @eric-haibin-lin

jasperzhong · 2020-08-05T08:11:35Z

byteps/mxnet/__init__.py

@@ -221,8 +221,6 @@ def _init_params(self):
                param_arrays = param._check_and_get(param._data, list)
                idx = self._param2idx[param.name]

-                if rank() != self.root_rank:
-                    param_arrays[0].__imul__(0)
                byteps_push_pull(param_arrays[0], version=0, priority=0,
                                 name="parameter_" + str(idx), is_average=False)


May be we should use is_average=True here?

I think there is no need to average, since the server just selects one received tensor and copy it to the stored buffer then send it to all workers.

You're right.

ZiyueHuang · 2020-08-05T08:55:36Z

For sync mode, this initialization is also necessary since it make sure that the parameters on each worker are the same at the beginning. The current implementation is problematic since all the workers may actually initialize the parameters to 0 (except when the root node sends the last init push), no matter what initializer the user chooses.

ZiyueHuang · 2020-08-05T12:06:56Z

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

jasperzhong · 2020-08-05T12:26:35Z

This problem also exists for the distributed gradient accumulation in the first round...

For the push phase in the first round of byteps_push_pull, the server will discard all the messages (except the last one) sent from all the workers. We should separate this step away from byteps_push_pull into a new api, say, byteps_init, and let byteps_push_pull only do the sum operation. Similar to InitImpl https://github.com/apache/incubator-mxnet/blob/master/src/kvstore/kvstore_dist.h#L193.

I don't think there is such a problem in the first round of pushpull of gradients. The stored has been allocated and initialized during the pushpull of parameters.

BTW, there is already an API called bps.init().

ZiyueHuang · 2020-08-05T13:19:18Z

@vycezhong The gradient and the weight for the same parameter are two different buffers in the server since they have different keys. See https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L203-L206. The problem is that for a new key, in the first round of byteps_push_pull, the server is not performing the sum operation and this is weird.

bps.init() is not for a particular parameter; see kvstore init api in MXNet.

jasperzhong · 2020-08-05T13:41:47Z

i see. you're right. I am sorry that i didn't notice parameter and gradient do not share the same buffer...

I think maybe we can reuse the parameter buffer? for example,

        for i, param in enumerate(self._params):
            byteps_declare_tensor("tensor_" + str(i)) 
...

    def _init_params(self):
                ....
                byteps_push_pull(param_arrays[0], version=0, priority=0,
                                 name="tensor_" + str(idx), is_average=False)

in this way, there is only one collection of buffers. And byteps_push_pull in _init_params will trigger initialization of the buffer. And then this buffer is used for gradients. I think it may be a solution.

ZiyueHuang · 2020-08-05T13:59:22Z

This might be a solution for training the model, but it is still weird that the first round of byteps_push_pull performs a random selection instead of the sum...

Furthermore, it may not be neccesary to maintain two buffers update_buf_ and store_ in https://github.com/bytedance/byteps/blob/master/byteps/server/server.h, since BytePS transmits gradients bidirectionally (both from worker to server and from server to worker). MXNet KVStore maintains these two buffers since MXNet pushes gradient from worker to server, then updates the weight on the server by the optimizer, then pulls weight from server to worker.

eric-haibin-lin · 2020-08-05T14:58:23Z

Init tensor comes from the root rank only: https://github.com/bytedance/byteps/blob/master/byteps/common/operations.cc#L331-L341

bobzhuyb · 2020-08-05T15:57:33Z

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

jasperzhong · 2020-08-06T01:38:26Z

Is it causing any real problems? The initialized value is not used later, so we did not care to make it deterministic.

Sorry for my careless. We found the logic is right.

This reverts commit a692fea.

fix distributed initialization

c7c030d

jasperzhong reviewed Aug 5, 2020

View reviewed changes

jasperzhong pushed a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020

hotfix: fix distributed initialization bytedance#285

a692fea

ZiyueHuang closed this Aug 5, 2020

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 5, 2020

hotfix: merge two buffer (mentioned in bytedance#285)

ba60a76

jasperzhong added a commit to jasperzhong/byteps that referenced this pull request Aug 6, 2020

Revert "hotfix: fix distributed initialization bytedance#285"

c6464a9

This reverts commit a692fea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix distributed initialization #285

fix distributed initialization #285

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

jasperzhong Aug 5, 2020

ZiyueHuang Aug 5, 2020 •

edited

Loading

jasperzhong Aug 5, 2020

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

ZiyueHuang commented Aug 5, 2020

jasperzhong commented Aug 5, 2020

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

jasperzhong commented Aug 5, 2020 •

edited

Loading

ZiyueHuang commented Aug 5, 2020

eric-haibin-lin commented Aug 5, 2020

bobzhuyb commented Aug 5, 2020

jasperzhong commented Aug 6, 2020

fix distributed initialization #285

fix distributed initialization #285

Conversation

ZiyueHuang commented Aug 5, 2020 • edited Loading

jasperzhong Aug 5, 2020

Choose a reason for hiding this comment

ZiyueHuang Aug 5, 2020 • edited Loading

Choose a reason for hiding this comment

jasperzhong Aug 5, 2020

Choose a reason for hiding this comment

ZiyueHuang commented Aug 5, 2020 • edited Loading

ZiyueHuang commented Aug 5, 2020

jasperzhong commented Aug 5, 2020

ZiyueHuang commented Aug 5, 2020 • edited Loading

jasperzhong commented Aug 5, 2020 • edited Loading

ZiyueHuang commented Aug 5, 2020

eric-haibin-lin commented Aug 5, 2020

bobzhuyb commented Aug 5, 2020

jasperzhong commented Aug 6, 2020

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

ZiyueHuang Aug 5, 2020 •

edited

Loading

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

ZiyueHuang commented Aug 5, 2020 •

edited

Loading

jasperzhong commented Aug 5, 2020 •

edited

Loading