[Inference] Fix inference latency issue when weights/neff are separated #584

JingyaHuang · 2024-04-30T13:11:54Z

What does this PR do?

As reported in #576, the inference latency is heavily impacted when the weights and neff are not inlined. This is because the weights are not automatically loaded to neuron devices, and if we don't do that we suffer from huge host-device communication overhead.

This PR is supposed to patch this.

Caveat: Current data parallel API doesn't consider the case when weights and neff are not inlined. Here we use the class WeightSeparatedDataParallel as a temporary workaround. This will be included in Neuron SDK 2.20, and by then time this class will be removed from Optimum Neuron. And the current non-inlined models still show 1.5X latency compared to inlined models according to several small quick experiments.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-04-30T13:15:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/neuron/modeling_base.py

optimum/exporters/neuron/convert.py

optimum/neuron/modeling_decoder.py

…ron sdk.

dacorvo

LGTM, thanks !

something not working with data parallel

5bc3516

JingyaHuang mentioned this pull request May 7, 2024

Change inline weights to Neff default value to True #590

Merged

JingyaHuang added 5 commits May 7, 2024 13:06

Merge branch 'main' into fix-non-inlined-perf

469a58c

add workaround

1e12227

fix

0c63ce8

fix style

7cc8288

remove comments

532b2a2

JingyaHuang marked this pull request as ready for review May 7, 2024 15:10

JingyaHuang added 9 commits May 7, 2024 15:13

fix doc build

96144d8

fix doc build

7416320

fix doc build

353858b

Merge branch 'main' into fix-non-inlined-perf

604ba9a

bump dev version

04b2e14

lazy loading

237e159

move custom dp class under sd modeling

83acdfe

fix?

1a54150

fix naming conflict on importing

45f2a4f

JingyaHuang requested review from michaelbenayoun and dacorvo May 9, 2024 13:48

michaelbenayoun reviewed May 13, 2024

View reviewed changes

optimum/neuron/modeling_base.py Outdated Show resolved Hide resolved

JingyaHuang added 2 commits May 20, 2024 09:39

Merge branch 'main' into fix-non-inlined-perf

d15d22d

add docstring

af35486

JingyaHuang requested a review from michaelbenayoun May 20, 2024 09:46

JingyaHuang added 2 commits May 20, 2024 09:59

fix import

b47967e

fix tests

052447e

JingyaHuang closed this May 20, 2024

JingyaHuang reopened this May 20, 2024

JingyaHuang added 2 commits May 21, 2024 08:56

fix test

2425138

fix test

4f3377a

JingyaHuang added 7 commits May 21, 2024 14:44

fix for decoder as well

1e924b0

try fix

a9345d9

try fix

00d1d5d

try fix

e211d41

try fix

fda3303

try fix

a17a3e8

try fix

abf45ce

dacorvo reviewed May 22, 2024

View reviewed changes

optimum/exporters/neuron/convert.py Show resolved Hide resolved

optimum/neuron/modeling_decoder.py Outdated Show resolved Hide resolved

JingyaHuang added 8 commits May 24, 2024 14:29

Merge branch 'main' into fix-non-inlined-perf

c37c9d1

fix style

00a48e8

fix typo

8107d83

add back previous fix

87c3902

add back test with subprocess to pass ddp

0ad907a

Merge branch 'main' into fix-non-inlined-perf

1a91d1a

leave test not on ddp until the cleanup on neuron device fixed in neu…

30c6736

…ron sdk.

for sdxl as well

068cf90

JingyaHuang requested a review from dacorvo May 27, 2024 16:29

dacorvo approved these changes May 28, 2024

View reviewed changes

JingyaHuang merged commit 7d840f3 into main May 28, 2024
13 checks passed

JingyaHuang deleted the fix-non-inlined-perf branch May 28, 2024 07:45

JingyaHuang mentioned this pull request Jun 12, 2024

Quite largely increased latency with weights/neff separated aws-neuron/aws-neuron-sdk#905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Fix inference latency issue when weights/neff are separated #584

[Inference] Fix inference latency issue when weights/neff are separated #584

JingyaHuang commented Apr 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 30, 2024

dacorvo left a comment

[Inference] Fix inference latency issue when weights/neff are separated #584

[Inference] Fix inference latency issue when weights/neff are separated #584

Conversation

JingyaHuang commented Apr 30, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Apr 30, 2024

dacorvo left a comment

Choose a reason for hiding this comment

JingyaHuang commented Apr 30, 2024 •

edited

Loading