Remove assumption that padding only occurs on last rank #6974

xylian86 · 2025-01-26T16:39:46Z

As discussed in PR-6918, padding can occur on multiple ranks with large DP degrees.

For example, with:

Flattened tensor size: 266240
DP degree: 768
Alignment: 1536
Required padding: 1024 (1536 * 174 - 266240)
Per-rank partition size: 348 (1536 * 174 / 768)
The padding occurs on last three ranks.

This PR removes the single-rank padding assumption for more general cases.

deepspeed/runtime/zero/stage_1_and_2.py

tjruwase · 2025-01-27T13:35:29Z

@xylian86, thanks for the quick solution!

@saforem2, can you please test this PR?

saforem2 · 2025-01-28T12:01:28Z

yes will work on testing this today, thanks!

As discussed in [PR-6918](#6918), padding can occur on multiple ranks with large DP degrees. For example, with: - Flattened tensor size: 266240 - DP degree: 768 - Alignment: 1536 - Required padding: 1024 (1536 * 174 - 266240) - Per-rank partition size: 348 (1536 * 174 / 768) - The padding occurs on last three ranks. This PR removes the single-rank padding assumption for more general cases. --------- Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>

…6974) As discussed in [PR-6918](deepspeedai#6918), padding can occur on multiple ranks with large DP degrees. For example, with: - Flattened tensor size: 266240 - DP degree: 768 - Alignment: 1536 - Required padding: 1024 (1536 * 174 - 266240) - Per-rank partition size: 348 (1536 * 174 / 768) - The padding occurs on last three ranks. This PR removes the single-rank padding assumption for more general cases. --------- Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: siqi <siqi@tecorigin.com>

saforem2 · 2025-02-07T15:28:56Z

This appears to be fixed.

I've added a new comment with details in the original PR

…6974) As discussed in [PR-6918](deepspeedai#6918), padding can occur on multiple ranks with large DP degrees. For example, with: - Flattened tensor size: 266240 - DP degree: 768 - Alignment: 1536 - Required padding: 1024 (1536 * 174 - 266240) - Per-rank partition size: 348 (1536 * 174 / 768) - The padding occurs on last three ranks. This PR removes the single-rank padding assumption for more general cases. --------- Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

fix: remove assumption that padding only occurs on last rank

1c42fa6

xylian86 requested review from tjruwase and tohtana as code owners January 26, 2025 16:39

tjruwase reviewed Jan 27, 2025

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

fix issue

0be1151

tjruwase approved these changes Jan 27, 2025

View reviewed changes

saforem2 and others added 4 commits January 28, 2025 10:30

Merge branch 'master' into zero12_padding_issue

cd072dd

Merge branch 'master' into zero12_padding_issue

2c1a44d

Merge branch 'master' into zero12_padding_issue

12cdd8f

Merge branch 'master' into zero12_padding_issue

03fe18f

tjruwase added this pull request to the merge queue Jan 31, 2025

Merged via the queue into deepspeedai:master with commit 4fea41f Jan 31, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove assumption that padding only occurs on last rank #6974

Remove assumption that padding only occurs on last rank #6974

xylian86 commented Jan 26, 2025

tjruwase commented Jan 27, 2025

saforem2 commented Jan 28, 2025

saforem2 commented Feb 7, 2025

Remove assumption that padding only occurs on last rank #6974

Remove assumption that padding only occurs on last rank #6974

Conversation

xylian86 commented Jan 26, 2025

tjruwase commented Jan 27, 2025

saforem2 commented Jan 28, 2025

saforem2 commented Feb 7, 2025