Initial support for Pipeline Parallelism #279

michaelbenayoun · 2023-10-30T18:36:49Z

This PR adds support for Pipeline Parallelism on a single node for the Llama architecture.
Multi-node training and other possible relevant architectures will be added in later PRs.

Training

PP works

Variants

Only tested via tests here (it will be tested more in the PR for multi-node training).

TP + PP works
DP + PP works (without ZeRO-1)
DP + PP works (with ZeRO-1)

Checkpointing

Can save sharded checkpoints
Can resume from sharded checkpoint
Update the consolidation function / command ?

Tests

Tests in tests/distributed/test_model_parallelization.py
Tests in tests/test_examples.py
New extensive test suite tests/distributed/test_common.py

Other

The cache system can download and upload compilation files from a training with PP
Update the examples

HuggingFaceDocBuilderDev · 2023-10-30T18:41:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

dacorvo

Just a few nits. LGTM, thanks !

tests/conftest.py

optimum/neuron/accelerate/accelerator.py

optimum/neuron/distributed/utils.py

optimum/neuron/utils/cache_utils.py

JingyaHuang

🔥 Awesome work for enabling PP, and improving training tests!! Just left some small nits.

JingyaHuang · 2024-01-18T13:48:48Z

examples/image-classification/run_image_classification.py

@@ -177,6 +195,15 @@ def main():
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

+    if model_args.use_auth_token is not None:
+        warnings.warn(
+            "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.",


which package? transformers?

Yes. But I'm not in favor of adding that. These files are updated automatically by cloning the examples from Transformers. This is a bad side effect but I think that's ok.

optimum/neuron/accelerate/accelerator.py

optimum/neuron/accelerate/utils/dataclasses.py

optimum/neuron/distributed/decoder_models.py

optimum/neuron/training_args.py

tests/distributed/test_model_parallelization.py

michaelbenayoun · 2024-01-23T09:41:10Z

I addressed the comments. @dacorvo @JingyaHuang wdyt?

dacorvo

LGTM, thanks !

JingyaHuang

LGTM! Let's get it merged!

michaelbenayoun added 3 commits October 30, 2023 11:48

Refactor and creation of PipelineParallelismSpecs

4bdf600

Refactoring

92b8253

[WIP] initial support for pp

e394ec5

michaelbenayoun added 26 commits October 31, 2023 19:14

[WIP] initial support for pp

2920df7

[WIP] initial support for pp

1b82fbc

[WIP] initial support for pp

4712e95

[WIP] initial support for pp

0c55877

[WIP] initial support for pp

0acf510

[WIP] initial support for pp

3ea12dd

Update examples

2fd6abf

[WIP] add tests

4fb51ee

Add PP to test_examples.py

c74b724

Merge branch 'main' into initial_pp

6aac412

[WIP] fix TP + PP training

d0df211

Merge branch 'main' into initial_pp

a4cc66c

Style

959b3b0

[WIP]

1ef90b8

Refactor Mistral for sequence parallelism

cbdf51f

Add DistributedTest class

0571524

[WIP] tests

f57a210

Refacotr

017bbbd

[WIP] tests

ce6e4ac

[WIP] tests

3e6586f

DistributedTest works

01cf4cd

[WIP] tests

ef25839

[WIP] tests

43550ba

[WIP] tests

db939b0

[WIP] tests

650771e

test_common almost done

2ad63a0

Cleanup

4e3e7ab

michaelbenayoun requested review from JingyaHuang and dacorvo January 10, 2024 16:04

Disable dp=4,tp=pp=2 for test_common for now

a82e44a

michaelbenayoun mentioned this pull request Jan 11, 2024

Bump hf libraries versions #403

Merged

michaelbenayoun added 13 commits January 11, 2024 12:10

Fix tests in test_common.py

533ffce

Merge branch 'main' into initial_pp

109aa67

Fix tests in test_common.py

f1b18d7

Fix

cfa5288

Fix test

d94057f

Fix test

dce046c

Fix

189bea9

Update workflow

51f0a65

Merge branch 'main' into initial_pp

bce46b5

Skip GPTNeo tests

7bdad6a

Move model to device by default

410a77b

Fix test

d7e85fb

Test without test_training

95499cf

dacorvo approved these changes Jan 22, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

optimum/neuron/accelerate/accelerator.py Outdated Show resolved Hide resolved

optimum/neuron/distributed/utils.py Outdated Show resolved Hide resolved

optimum/neuron/utils/cache_utils.py Show resolved Hide resolved

JingyaHuang reviewed Jan 22, 2024

View reviewed changes

michaelbenayoun added 3 commits January 22, 2024 17:56

Apply David's suggestions

0adbab6

Apply Jingya's suggestion

840ea9d

Move distributed test conftest

e6fa03a

michaelbenayoun requested review from dacorvo and JingyaHuang January 23, 2024 09:41

dacorvo approved these changes Jan 23, 2024

View reviewed changes

JingyaHuang approved these changes Jan 23, 2024

View reviewed changes

michaelbenayoun merged commit ca6c4ff into main Jan 23, 2024
8 checks passed

michaelbenayoun deleted the initial_pp branch January 23, 2024 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for Pipeline Parallelism #279

Initial support for Pipeline Parallelism #279

michaelbenayoun commented Oct 30, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 30, 2023

dacorvo left a comment

JingyaHuang left a comment

JingyaHuang Jan 18, 2024

michaelbenayoun Jan 22, 2024

michaelbenayoun commented Jan 23, 2024

dacorvo left a comment

JingyaHuang left a comment

Initial support for Pipeline Parallelism #279

Initial support for Pipeline Parallelism #279

Conversation

michaelbenayoun commented Oct 30, 2023 • edited Loading

Training

PP works

Variants

Checkpointing

Tests

Other

HuggingFaceDocBuilderDev commented Oct 30, 2023

dacorvo left a comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

JingyaHuang Jan 18, 2024

Choose a reason for hiding this comment

michaelbenayoun Jan 22, 2024

Choose a reason for hiding this comment

michaelbenayoun commented Jan 23, 2024

dacorvo left a comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

michaelbenayoun commented Oct 30, 2023 •

edited

Loading