[Model] Support Multi-GPU for Transformer model #356

lingfanyu · 2019-01-14T22:22:54Z

Description

This PR uses mutli-process to parallelize training of Transformer model. With batch size 128, 4 GPU gives 3.3x speed up. With batch size 4096, 4 GPU give 3.8x speed up.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change

lingfanyu · 2019-01-14T23:24:53Z

@yzh119 Let's work together on this PR. Currently, there seems to be a bug in this PR. If you try the grad_accum argument to accumulate gradients of multiple batch, the loss drops very slowly.

Close for now. Reopen when we finish.

yzh119 · 2019-01-15T08:58:38Z

torch.distributed hangs at initialization if we do not specify the first gpu_id to 0.

examples/pytorch/transformer/dataset/__init__.py

yzh119 · 2019-01-15T09:56:49Z

I have no write access to your repo, so I put the updated version here.
The issue of not converge may result from incorrect schedule settings, I've fixed the problem and now it successfully converges in 4-GPU setting. (But I think in small data like multi30k, shuffle first then divide is a better choice)

lingfanyu · 2019-01-15T14:43:06Z

@yzh119 just added you as collaborator. Can you move your changes here?

yzh119 · 2019-01-19T13:26:02Z

With sparse_softmax kernel, the problem of not convergence does not appear anymore.

lingfanyu · 2019-02-02T04:00:41Z

@yzh119 I think I have finished this PR and fixed the synchronization bug. And now the 4 GPU behaves like single GPU.

Can you review again and then we can merge.

lingfanyu added 5 commits January 14, 2019 10:00

multi-process version of transformer

cccb6da

lots of fix

8107b50

fix bugs and accum gradients for multiple batches

8e44663

many fixes

3cd6802

minor

3f6ff9b

lingfanyu closed this Jan 14, 2019

yzh119 reviewed Jan 15, 2019

View reviewed changes

examples/pytorch/transformer/dataset/__init__.py Outdated Show resolved Hide resolved

upd

6e40f29

lingfanyu reopened this Jan 15, 2019

jermainewang mentioned this pull request Jan 20, 2019

[Tutorial] Batched Graph Classification #360

Merged

6 tasks

yzh119 and others added 6 commits January 29, 2019 15:26

Merge branch 'master' into mgpu-transformer

5bdfaa9

set torch device

59f1ca7

fix bugs

3f682e5

fix and minor

91f9465

Merge branch 'master' into mgpu-transformer

6aeba03

comments and clean up

c11c4bf

lingfanyu added 4 commits February 3, 2019 11:15

Merge branch 'master' into mgpu-transformer

f131aa7

Merge branch 'master' into mgpu-transformer

a02689e

Merge branch 'master' into mgpu-transformer

3429123

uncomment viz code

e5a1299

lingfanyu merged commit 29dd22e into dmlc:master Feb 12, 2019

lingfanyu deleted the mgpu-transformer branch February 12, 2019 01:17

jermainewang mentioned this pull request Feb 18, 2019

[Roadmap] v0.2 release checklist #302

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Multi-GPU for Transformer model #356

[Model] Support Multi-GPU for Transformer model #356

lingfanyu commented Jan 14, 2019 •

edited

Loading

lingfanyu commented Jan 14, 2019

yzh119 commented Jan 15, 2019 •

edited

Loading

yzh119 commented Jan 15, 2019

lingfanyu commented Jan 15, 2019

yzh119 commented Jan 19, 2019

lingfanyu commented Feb 2, 2019

[Model] Support Multi-GPU for Transformer model #356

[Model] Support Multi-GPU for Transformer model #356

Conversation

lingfanyu commented Jan 14, 2019 • edited Loading

Description

Checklist

lingfanyu commented Jan 14, 2019

yzh119 commented Jan 15, 2019 • edited Loading

yzh119 commented Jan 15, 2019

lingfanyu commented Jan 15, 2019

yzh119 commented Jan 19, 2019

lingfanyu commented Feb 2, 2019

lingfanyu commented Jan 14, 2019 •

edited

Loading

yzh119 commented Jan 15, 2019 •

edited

Loading