Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HF][Optimum] Compiling unet in stable diffusion XL pipeline failed since Neuron SDK 2.18 #859

Open
JingyaHuang opened this issue Apr 3, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@JingyaHuang
Copy link

Hi team, when trying to bump Optimum Neuron to the latest Neuron sdk 2.18 release, we notice that the compilation of unet for SDXL model failed with the latest compiler. Here are more details about the regression:

  • System information
OS unbuntu 20.04.5 LTS
  • Neuron driver
aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]
  • Pip installed
aws-neuronx-runtime-discovery 2.9
diffusers                     0.27.2
libneuronxla                  0.5.971
neuronx-cc                    2.13.66.0+6dfecc895
numpy                         1.24.4
optimum                       1.18.0
optimum-neuron                0.0.21.dev0
torch                         1.13.1
torch-neuronx                 1.13.1.1.14.0
torch-xla                     1.13.1+torchneurone
torchvision                   0.14.1
transformers                  4.36.2
Error log
=== BIR verification failed ===
Reason: Pattern accesses 48 (> 32) partitions starting at partition 32
Instruction: I-36948
Opcode: GenericCopy
Output index: 0
Argument AP:
Access Pattern: [[1,48],[1,1],[1,1]]
SymbolicAP
Memory Location: {concatenate.3_set}@SB
2024-04-03T09:11:19Z 
2024-04-03T09:11:19Z Diagnostic information:
2024-04-03T09:11:19Z   NeuronX Compiler version 2.13.66.0+6dfecc895
2024-04-03T09:11:19Z   
2024-04-03T09:11:19Z   Python version 3.8.10
2024-04-03T09:11:19Z   HWM version 2.13.66.0+6dfecc895
2024-04-03T09:11:19Z   NumPy version 1.24.4
2024-04-03T09:11:19Z   
2024-04-03T09:11:19Z   Running on AMI ami-09cd747c78a9add63
2024-04-03T09:11:19Z   Running in region use1-az6
2024-04-03T09:11:19Z 
2024-04-03T09:11:19Z Diagnostic logs stored in /home/ubuntu/optimum-neuron/log-neuron-cc.txt
An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.
The export is failed and unet neuron model won't be stored.
An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.
  • Reproduction
from optimum.neuron import NeuronStableDiffusionXLPipeline


# [Export]
model_id = "echarlaix/tiny-random-stable-diffusion-xl"
num_images_per_prompt = 1
input_shapes = {"batch_size": 1, "height": 64, "width": 64, "num_images_per_prompt": num_images_per_prompt}
compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}

# Compile and save
stable_diffusion = NeuronStableDiffusionXLPipeline.from_pretrained(
    model_id, export=True, **compiler_args, **input_shapes
)

save_directory = "tiny_sdxl_neuronx/"
stable_diffusion.save_pretrained(save_directory)

The test above works as expected with Neuron SDK 2.17.1.

@JingyaHuang
Copy link
Author

Also tried with PyTorch 2.1.2 setup, not working neither.

@aws-bhegedus
Copy link
Contributor

Hi Jingya, I'm trying to reproduce the problem. I installed optimum and optimum-neuron with
pip install "optimum[neuronx, diffusers]"
based on https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion.
However this seems to get v0.0.3 which doesn't find NeuronStableDiffusionXLPipeline. I also tried downgrading to 0.0.2 which has another problem. Is this expected with these versions, and is there a way to get 0.021? Thanks.

@JingyaHuang
Copy link
Author

JingyaHuang commented Apr 3, 2024

The installation with neuronx extra is what we are going to fix with the 0.0.21 optimum-neuron release. Fow now, to install the latest optimum-neuron release(0.0.20), could you try with:

pip install optimum==1.18.0
pip install optimum-neuron==0.0.20

Or the 0.0.21 dev version can be installed from source:

pip install git+https://github.com/huggingface/optimum-neuron

Then you could install pip install diffusers.

@aws-bhegedus
Copy link
Contributor

Thanks Jingya, I updated optimum-neuron and diffusers and now I can reproduce the issue.

@aws-bhegedus
Copy link
Contributor

Hi Jingya, I found that the issue can be prevented if we set inline_weights_to_neff=True when tracing the UNet. Would that be a sufficient workaround for now? I will also look into the root cause but that may take some time.

@JingyaHuang
Copy link
Author

Hi @aws-bhegedus, thanks for investigating it!

Optimum Neuron could force setting inline_weights_to_neff=False for sdxl models for now. But given that our caching mechanism relies on the neff weights separation, we won't be able to cache and load sdxl models (which takes time for the compilation).

@aws-bhegedus
Copy link
Contributor

Thanks Jingya, we will have a fix in a future release to allow enabling the caching.
Is this problem only there for SDXL-random-tiny? Curious about SDXL-base, which I believe is larger and takes longer to compile so may be a bigger problem.

@JingyaHuang
Copy link
Author

Thanks @aws-bhegedus, that will be awesome!

tiny-random-stable-diffusion-xl is a smaller version (fewer layers) of sdxl models in the pipe with random weights that we built to shorten the testing time, if the compilation fails for the tiny version, it's very unlikely that it could work for the larger pretrained checkpoint. And since the compilation of all sdxl components takes more than an hour, without the caching it could be a little bit discouraging for first-time users.

@JingyaHuang
Copy link
Author

The issue still exist with the latest neuron SDK 2.19.1:

***** Compiling unet *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
...
[NLA001]  Unhandled exception with message: === BIR verification failed ===
Reason: Pattern accesses 48 (> 32) partitions starting at partition 32
Instruction: I-29178
Opcode: GenericCopy
Instruction Source: (|V2<48 x 1> $29178:29178)0:
Output index: 0
Argument AP:
Access Pattern: [[1,48],[1,1],[1,1]]
SymbolicAP
Memory Location: {concatenate.3_set}@SB - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
An error occured when trying to trace unet with the error message: neuronx-cc failed with 70.
The export is failed and unet neuron model won't be stored.
***** Compiling vae_encoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
.
Compiler status PASS
[Compilation Time] 8.55 seconds.
***** Compiling vae_decoder *****
Using Neuron: --auto-cast matmul
Using Neuron: --auto-cast-type bf16
.
Compiler status PASS
[Compilation Time] 8.27 seconds.
[Total compilation Time] 38.63 seconds.
Traceback (most recent call last):
  File "test_non_inline.py", line 11, in <module>
    stable_diffusion = NeuronStableDiffusionXLPipeline.from_pretrained(
  File "/home/ubuntu/pyvenv/aws_neuron_venv_2.19.1/lib/python3.8/site-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 714, in _from_transformers
    return cls._export(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 954, in _export
    return cls._from_pretrained(
  File "/home/ubuntu/optimum-neuron/optimum/neuron/utils/require_utils.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 670, in _from_pretrained
    data_parallel_mode = cls.set_default_dp_mode(configs["unet"])
KeyError: 'unet'

@aws-bhegedus
Copy link
Contributor

Thanks for testing @JingyaHuang. I was able to reproduce the issue again and we are looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants