Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

difference between sync and async distributed training. #550

Closed
shawnwang18 opened this issue Jan 31, 2018 · 6 comments
Closed

difference between sync and async distributed training. #550

shawnwang18 opened this issue Jan 31, 2018 · 6 comments
Labels

Comments

@shawnwang18
Copy link

Hi, I am trying to understand the distributed training perf on GPU clusters, little confused about the following 2 distributed parallel mode:

#1 sync mode will run worker jobs on PS GPU devices, and shard the variables across GPU0 in PS replicas.
#2 async mode will shard the variables across PS services (GPU0 for PS replicas), but running the worker job on worker replicas.

right?

what's the sync and async here means? why it is called sync and async? what's the motivations for the sync and async mode?

Could someone give some explain?

Thanks.

@WencongXiao
Copy link

I observe the same behavior and I want to know the intuitions for this implementation. Besides, what's the workitem for worker in sync mode? Should I grant GPU to workers?

@zhypku
Copy link

zhypku commented Feb 3, 2018

According to my understanding of the code, the program will call the model function on the "datashard_device".
In single-machine training (and async mode), the datashard_device will be the devices assigned to the worker process.
However, in sync mode, the datashard_device will be those assigned to the ps process. So the training is actually running in the ps process.
This is to some extent counter intuitive, since a common practice is just use ps to do model sync. I wonder why we use this design and what the benefit is.
Thanks.

@rsepassi
Copy link
Contributor

rsepassi commented Feb 9, 2018

The naming makes things a bit confusing. Let me try to clarify things.

In sync mode, the worker device is actually quite dumb and isn't doing much other than calling Session.run, which launches the actual work to be done on the parameter servers (that's where the confusing naming is - the parameter servers in this case are the "workers" and the worker is actually just a master). Because all work is happening synchronously across the PS workers, noise is reduced - it's equivalent to multiplying your batch size by the number of PS workers you have.

In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers.

@lukaszkaiser
Copy link
Contributor

I think the ps naming vs worker is indeed reversed in T2T, but it seems tedious to change.

@zhypku
Copy link

zhypku commented Feb 11, 2018

@rsepassi
You said "it's equivalent to multiplying your batch size by the number of PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right? Say, if I have two PS processes, each of which has 2 GPUs, then the global batch size is hparams.batch_size * 4. Is that right?

@rsepassi
Copy link
Contributor

rsepassi commented Feb 11, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants