difference between sync and async distributed training. #550

shawnwang18 · 2018-01-31T03:26:15Z

Hi, I am trying to understand the distributed training perf on GPU clusters, little confused about the following 2 distributed parallel mode:

#1 sync mode will run worker jobs on PS GPU devices, and shard the variables across GPU0 in PS replicas.
#2 async mode will shard the variables across PS services (GPU0 for PS replicas), but running the worker job on worker replicas.

right?

what's the sync and async here means? why it is called sync and async? what's the motivations for the sync and async mode?

Could someone give some explain?

Thanks.

WencongXiao · 2018-02-02T09:49:03Z

I observe the same behavior and I want to know the intuitions for this implementation. Besides, what's the workitem for worker in sync mode? Should I grant GPU to workers?

zhypku · 2018-02-03T04:48:34Z

According to my understanding of the code, the program will call the model function on the "datashard_device".
In single-machine training (and async mode), the datashard_device will be the devices assigned to the worker process.
However, in sync mode, the datashard_device will be those assigned to the ps process. So the training is actually running in the ps process.
This is to some extent counter intuitive, since a common practice is just use ps to do model sync. I wonder why we use this design and what the benefit is.
Thanks.

rsepassi · 2018-02-09T00:20:01Z

The naming makes things a bit confusing. Let me try to clarify things.

In sync mode, the worker device is actually quite dumb and isn't doing much other than calling Session.run, which launches the actual work to be done on the parameter servers (that's where the confusing naming is - the parameter servers in this case are the "workers" and the worker is actually just a master). Because all work is happening synchronously across the PS workers, noise is reduced - it's equivalent to multiplying your batch size by the number of PS workers you have.

In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers.

lukaszkaiser · 2018-02-10T05:31:30Z

I think the ps naming vs worker is indeed reversed in T2T, but it seems tedious to change.

zhypku · 2018-02-11T08:14:45Z

@rsepassi
You said "it's equivalent to multiplying your batch size by the number of PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right? Say, if I have two PS processes, each of which has 2 GPUs, then the global batch size is hparams.batch_size * 4. Is that right?

rsepassi · 2018-02-11T17:27:48Z

Correct.

…

On Sun, Feb 11, 2018 at 12:14 AM Hanyu Zhao ***@***.***> wrote: @rsepassi <https://github.com/rsepassi> You said "it's equivalent to multiplying your batch size by the number of PS workers you have." The "number of PS workers" is the number of GPUs allocated to PS, right? Say, if I have two PS processes, each of which has 2 GPUs, then the global batch size is hparams.batch_size * 4. Is that right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#550 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEGW1MwSkus2dOmJkelrSi0ES-cQse6ks5tTqF3gaJpZM4RzXjt> .

rsepassi added the question label Feb 9, 2018

lukaszkaiser closed this as completed Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difference between sync and async distributed training. #550

difference between sync and async distributed training. #550

shawnwang18 commented Jan 31, 2018

WencongXiao commented Feb 2, 2018

zhypku commented Feb 3, 2018

rsepassi commented Feb 9, 2018

lukaszkaiser commented Feb 10, 2018

zhypku commented Feb 11, 2018

rsepassi commented Feb 11, 2018 via email

difference between sync and async distributed training. #550

difference between sync and async distributed training. #550

Comments

shawnwang18 commented Jan 31, 2018

WencongXiao commented Feb 2, 2018

zhypku commented Feb 3, 2018

rsepassi commented Feb 9, 2018

lukaszkaiser commented Feb 10, 2018

zhypku commented Feb 11, 2018

rsepassi commented Feb 11, 2018 via email