[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

JingyaHuang · 2024-04-03T09:23:47Z

Hi team, when trying to bump Optimum Neuron to the latest Neuron sdk 2.18 release, we notice that the compilation of unet for SDXL model failed with the latest compiler. Here are more details about the regression:

System information

OS unbuntu 20.04.5 LTS

Neuron driver

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]

Pip installed

aws-neuronx-runtime-discovery 2.9
diffusers                     0.27.2
libneuronxla                  0.5.971
neuronx-cc                    2.13.66.0+6dfecc895
numpy                         1.24.4
optimum                       1.18.0
optimum-neuron                0.0.21.dev0
torch                         1.13.1
torch-neuronx                 1.13.1.1.14.0
torch-xla                     1.13.1+torchneurone
torchvision                   0.14.1
transformers                  4.36.2

Error log

=== BIR verification failed ===
Reason: Pattern accesses 48 (> 32) partitions starting at partition 32
Instruction: I-36948
Opcode: GenericCopy
Output index: 0
Argument AP:
Access Pattern: [[1,48],[1,1],[1,1]]
SymbolicAP
Memory Location: {concatenate.3_set}@SB
2024-04-03T09:11:19Z 
2024-04-03T09:11:19Z Diagnostic information:
2024-04-03T09:11:19Z   NeuronX Compiler version 2.13.66.0+6dfecc895
2024-04-03T09:11:19Z   
2024-04-03T09:11:19Z   Python version 3.8.10
2024-04-03T09:11:19Z   HWM version 2.13.66.0+6dfecc895
2024-04-03T09:11:19Z   NumPy version 1.24.4
2024-04-03T09:11:19Z   
2024-04-03T09:11:19Z   Running on AMI ami-09cd747c78a9add63
2024-04-03T09:11:19Z   Running in region use1-az6
2024-04-03T09:11:19Z 
2024-04-03T09:11:19Z Diagnostic logs stored in /home/ubuntu/optimum-neuron/log-neuron-cc.txt
An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.
The export is failed and unet neuron model won't be stored.

An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.

Reproduction

from optimum.neuron import NeuronStableDiffusionXLPipeline


# [Export]
model_id = "echarlaix/tiny-random-stable-diffusion-xl"
num_images_per_prompt = 1
input_shapes = {"batch_size": 1, "height": 64, "width": 64, "num_images_per_prompt": num_images_per_prompt}
compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}

# Compile and save
stable_diffusion = NeuronStableDiffusionXLPipeline.from_pretrained(
    model_id, export=True, **compiler_args, **input_shapes
)

save_directory = "tiny_sdxl_neuronx/"
stable_diffusion.save_pretrained(save_directory)

The test above works as expected with Neuron SDK 2.17.1.

The text was updated successfully, but these errors were encountered:

JingyaHuang · 2024-04-03T09:34:46Z

Also tried with PyTorch 2.1.2 setup, not working neither.

aws-bhegedus · 2024-04-03T19:50:13Z

Hi Jingya, I'm trying to reproduce the problem. I installed optimum and optimum-neuron with
pip install "optimum[neuronx, diffusers]"
based on https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion.
However this seems to get v0.0.3 which doesn't find NeuronStableDiffusionXLPipeline. I also tried downgrading to 0.0.2 which has another problem. Is this expected with these versions, and is there a way to get 0.021? Thanks.

JingyaHuang · 2024-04-03T20:31:53Z

The installation with neuronx extra is what we are going to fix with the 0.0.21 optimum-neuron release. Fow now, to install the latest optimum-neuron release(0.0.20), could you try with:

pip install optimum==1.18.0
pip install optimum-neuron==0.0.20

Or the 0.0.21 dev version can be installed from source:

pip install git+https://github.com/huggingface/optimum-neuron

Then you could install pip install diffusers.

aws-bhegedus · 2024-04-04T16:45:23Z

Thanks Jingya, I updated optimum-neuron and diffusers and now I can reproduce the issue.

aws-bhegedus · 2024-04-04T20:23:43Z

Hi Jingya, I found that the issue can be prevented if we set inline_weights_to_neff=True when tracing the UNet. Would that be a sufficient workaround for now? I will also look into the root cause but that may take some time.

JingyaHuang · 2024-04-05T09:22:50Z

Hi @aws-bhegedus, thanks for investigating it!

Optimum Neuron could force setting inline_weights_to_neff=False for sdxl models for now. But given that our caching mechanism relies on the neff weights separation, we won't be able to cache and load sdxl models (which takes time for the compilation).

aws-bhegedus · 2024-04-06T01:57:09Z

Thanks Jingya, we will have a fix in a future release to allow enabling the caching.
Is this problem only there for SDXL-random-tiny? Curious about SDXL-base, which I believe is larger and takes longer to compile so may be a bigger problem.

JingyaHuang · 2024-04-08T08:37:45Z

Thanks @aws-bhegedus, that will be awesome!

tiny-random-stable-diffusion-xl is a smaller version (fewer layers) of sdxl models in the pipe with random weights that we built to shorten the testing time, if the compilation fails for the tiny version, it's very unlikely that it could work for the larger pretrained checkpoint. And since the compilation of all sdxl components takes more than an hour, without the caching it could be a little bit discouraging for first-time users.

JingyaHuang · 2024-07-25T07:24:58Z

The issue still exist with the latest neuron SDK 2.19.1:

***** Compiling unet *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
...
[NLA001]  Unhandled exception with message: === BIR verification failed ===
Reason: Pattern accesses 48 (> 32) partitions starting at partition 32
Instruction: I-29178
Opcode: GenericCopy
Instruction Source: (|V2<48 x 1> $29178:29178)0:
Output index: 0
Argument AP:
Access Pattern: [[1,48],[1,1],[1,1]]
SymbolicAP
Memory Location: {concatenate.3_set}@SB - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.
The export is failed and unet neuron model won't be stored.
***** Compiling vae_encoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
.
Compiler status PASS
[Compilation Time] 8.55 seconds.
***** Compiling vae_decoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
.
Compiler status PASS
[Compilation Time] 8.27 seconds.
[Total compilation Time] 38.63 seconds.
Traceback (most recent call last):
  File "test_non_inline.py", line 11, in <module>
    stable_diffusion = NeuronStableDiffusionXLPipeline.from_pretrained(
  File "/home/ubuntu/pyvenv/aws_neuron_venv_2.19.1/lib/python3.8/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 714, in _from_transformers
    return cls._export(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 954, in _export
    return cls._from_pretrained(
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 670, in _from_pretrained
    data_parallel_mode = cls.set_default_dp_mode(configs["unet"])
KeyError: 'unet'

aws-bhegedus · 2024-08-01T14:14:38Z

Thanks for testing @JingyaHuang. I was able to reproduce the issue again and we are looking into it.

JingyaHuang mentioned this issue Apr 5, 2024

Disable weights / neff separation of SDXL's UNET for neuron sdk 2.18 huggingface/optimum-neuron#554

Merged

3 tasks

JingyaHuang mentioned this issue Jul 25, 2024

[DO NOT MERGE] Restore non-inlined SDXL huggingface/optimum-neuron#668

Draft

3 tasks

dscain mentioned this issue Oct 7, 2024

neuronx-cc segfault when building Stable Diffusion SDXL+ControlNet on Amazon Linux 2 #1002

Open

aws-taylor added the bug Something isn't working label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

JingyaHuang commented Apr 3, 2024

JingyaHuang commented Apr 3, 2024

aws-bhegedus commented Apr 3, 2024

JingyaHuang commented Apr 3, 2024 •

edited

Loading

aws-bhegedus commented Apr 4, 2024

aws-bhegedus commented Apr 4, 2024

JingyaHuang commented Apr 5, 2024

aws-bhegedus commented Apr 6, 2024

JingyaHuang commented Apr 8, 2024

JingyaHuang commented Jul 25, 2024

aws-bhegedus commented Aug 1, 2024

[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

Comments

JingyaHuang commented Apr 3, 2024

JingyaHuang commented Apr 3, 2024

aws-bhegedus commented Apr 3, 2024

JingyaHuang commented Apr 3, 2024 • edited Loading

aws-bhegedus commented Apr 4, 2024

aws-bhegedus commented Apr 4, 2024

JingyaHuang commented Apr 5, 2024

aws-bhegedus commented Apr 6, 2024

JingyaHuang commented Apr 8, 2024

JingyaHuang commented Jul 25, 2024

aws-bhegedus commented Aug 1, 2024

JingyaHuang commented Apr 3, 2024 •

edited

Loading