-
Notifications
You must be signed in to change notification settings - Fork 3.5k
difference between sync and async distributed training. #550
Comments
I observe the same behavior and I want to know the intuitions for this implementation. Besides, what's the workitem for worker in sync mode? Should I grant GPU to workers? |
According to my understanding of the code, the program will call the model function on the "datashard_device". |
The naming makes things a bit confusing. Let me try to clarify things. In sync mode, the worker device is actually quite dumb and isn't doing much other than calling In async mode, the workers are each running independently and running training on themselves. The parameter servers are being used only for parameters. This is noisier because there is no synchronization between the workers. |
I think the ps naming vs worker is indeed reversed in T2T, but it seems tedious to change. |
@rsepassi |
Correct.
…On Sun, Feb 11, 2018 at 12:14 AM Hanyu Zhao ***@***.***> wrote:
@rsepassi <https://github.com/rsepassi>
You said "it's equivalent to multiplying your batch size by the number of
PS workers you have."
The "number of PS workers" is the number of GPUs allocated to PS, right?
Say, if I have two PS processes, each of which has 2 GPUs, then the global
batch size is hparams.batch_size * 4. Is that right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABEGW1MwSkus2dOmJkelrSi0ES-cQse6ks5tTqF3gaJpZM4RzXjt>
.
|
Hi, I am trying to understand the distributed training perf on GPU clusters, little confused about the following 2 distributed parallel mode:
#1 sync mode will run worker jobs on PS GPU devices, and shard the variables across GPU0 in PS replicas.
#2 async mode will shard the variables across PS services (GPU0 for PS replicas), but running the worker job on worker replicas.
right?
what's the sync and async here means? why it is called sync and async? what's the motivations for the sync and async mode?
Could someone give some explain?
Thanks.
The text was updated successfully, but these errors were encountered: