[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

MHBZHY · 2023-07-10T02:58:53Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

What happened?

Running launch.py then Segmentation fault. Both conda torch and manual installed torch in venv.

Steps to reproduce the problem

Activate docker container then just run /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test, get Segmentation fault
Use export PYTORCH_ROCM_ARCH="gfx1100" to manually install torch & torchvision in venv.
Use new venv to run launch.py, get Segmentation fault again

What should have happened?

I have run it successful on rocm5.5 docker.

Version or Commit where the problem happens

1.4.0

What Python version are you running on ?

Python 3.9.x (below, no recommended)

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

AMD GPUs (RX 6000 above)

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test

List of extensions

no

Console logs

(gdb) run /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
Starting program: /dockerx/stable-diffusion-webui/venv/bin/python /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 533]
[Detaching after fork from child process 534]
fatal: not a git repository (or any of the parent directories): .git
[Detaching after fork from child process 535]
fatal: not a git repository (or any of the parent directories): .git
Python 3.8.16 (default, Jun 12 2023, 18:09:05) 
[GCC 11.2.0]
Version: ## 1.4.0
Commit hash: <none>
[Detaching after fork from child process 536]
[Detaching after fork from child process 538]
[Detaching after fork from child process 540]
[Detaching after fork from child process 542]
Installing requirements
[Detaching after fork from child process 544]
Launching Web UI with arguments: --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
[New Thread 0x7ffee73ff700 (LWP 550)]
[New Thread 0x7ffee6bfe700 (LWP 551)]
[New Thread 0x7ffee23fd700 (LWP 552)]
[New Thread 0x7ffedfbfc700 (LWP 553)]
[New Thread 0x7ffedd3fb700 (LWP 554)]
[New Thread 0x7ffedabfa700 (LWP 555)]
[New Thread 0x7ffed83f9700 (LWP 556)]
[New Thread 0x7ffed5bf8700 (LWP 557)]
[New Thread 0x7ffed33f7700 (LWP 558)]
[New Thread 0x7ffed0bf6700 (LWP 559)]
[New Thread 0x7ffece3f5700 (LWP 560)]
[New Thread 0x7ffecbbf4700 (LWP 561)]
[New Thread 0x7ffec93f3700 (LWP 562)]
[New Thread 0x7ffec6bf2700 (LWP 563)]
[New Thread 0x7ffec43f1700 (LWP 564)]
[New Thread 0x7ffebe9af700 (LWP 565)]
[New Thread 0x7ffdbdfff700 (LWP 566)]
[Thread 0x7ffdbdfff700 (LWP 566) exited]
warning: Loadable section ".note.gnu.property" outside of ELF segments
No module 'xformers'. Proceeding without it.
[New Thread 0x7ffda9146700 (LWP 567)]
[New Thread 0x7ffd9c1ff700 (LWP 568)]
[New Thread 0x7ffd9b9fe700 (LWP 569)]
[New Thread 0x7ffd971fd700 (LWP 570)]
[New Thread 0x7ffd949fc700 (LWP 571)]
[New Thread 0x7ffd921fb700 (LWP 572)]
[New Thread 0x7ffd8f9fa700 (LWP 573)]
[New Thread 0x7ffd8f1f9700 (LWP 574)]
[New Thread 0x7ffd8a9f8700 (LWP 575)]
[New Thread 0x7ffd881f7700 (LWP 576)]
[New Thread 0x7ffd859f6700 (LWP 577)]
[New Thread 0x7ffd831f5700 (LWP 578)]
[New Thread 0x7ffd809f4700 (LWP 579)]
[New Thread 0x7ffd7e1f3700 (LWP 580)]
[New Thread 0x7ffd7b9f2700 (LWP 581)]
[New Thread 0x7ffd7b1f1700 (LWP 582)]
[Thread 0x7ffd7b1f1700 (LWP 582) exited]
[Thread 0x7ffd7b9f2700 (LWP 581) exited]
[Thread 0x7ffd7e1f3700 (LWP 580) exited]
[Thread 0x7ffd809f4700 (LWP 579) exited]
[Thread 0x7ffd831f5700 (LWP 578) exited]
[Thread 0x7ffd859f6700 (LWP 577) exited]
[Thread 0x7ffd881f7700 (LWP 576) exited]
[Thread 0x7ffd8a9f8700 (LWP 575) exited]
[Thread 0x7ffd8f1f9700 (LWP 574) exited]
[Thread 0x7ffd8f9fa700 (LWP 573) exited]
[Thread 0x7ffd921fb700 (LWP 572) exited]
[Thread 0x7ffd949fc700 (LWP 571) exited]
[Thread 0x7ffd971fd700 (LWP 570) exited]
[Thread 0x7ffd9b9fe700 (LWP 569) exited]
[Thread 0x7ffd9c1ff700 (LWP 568) exited]
[Thread 0x7ffec43f1700 (LWP 564) exited]
[Thread 0x7ffec6bf2700 (LWP 563) exited]
[Thread 0x7ffec93f3700 (LWP 562) exited]
[Thread 0x7ffecbbf4700 (LWP 561) exited]
[Thread 0x7ffece3f5700 (LWP 560) exited]
[Thread 0x7ffed0bf6700 (LWP 559) exited]
[Thread 0x7ffed33f7700 (LWP 558) exited]
[Thread 0x7ffed5bf8700 (LWP 557) exited]
[Thread 0x7ffed83f9700 (LWP 556) exited]
[Thread 0x7ffedabfa700 (LWP 555) exited]
[Thread 0x7ffedd3fb700 (LWP 554) exited]
[Thread 0x7ffedfbfc700 (LWP 553) exited]
[Thread 0x7ffee23fd700 (LWP 552) exited]
[Thread 0x7ffee6bfe700 (LWP 551) exited]
[Thread 0x7ffee73ff700 (LWP 550) exited]
[Detaching after fork from child process 583]
[Detaching after fork from child process 584]
[New Thread 0x7ffec43f1700 (LWP 585)]
Loading weights [e714ee20aa] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/abyssorangemix2_Hard.safetensors
[New Thread 0x7ffec6bf2700 (LWP 586)]
loading settings: JSONDecodeError
[New Thread 0x7ffec93f3700 (LWP 587)]
[Thread 0x7ffec93f3700 (LWP 587) exited]
Traceback (most recent call last):
  File "/dockerx/stable-diffusion-webui/modules/ui_loadsave.py", line 26, in __init__
    self.ui_settings = self.read_from_file()
  File "/dockerx/stable-diffusion-webui/modules/ui_loadsave.py", line 117, in read_from_file
    return json.load(file)
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

[New Thread 0x7ffec93f3700 (LWP 588)]
preload_extensions_git_metadata for 7 extensions took 0.00s
--Type <RET> for more, q to quit, c to continue without paging--c

Thread 36 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffec6bf2700 (LWP 586)]
0x00007fffdd01c519 in ?? () from /opt/rocm/lib/libamdhip64.so.5
(gdb) bt
#0  0x00007fffdd01c519 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#1  0x00007fffdcfd091c in ?? () from /opt/rocm/lib/libamdhip64.so.5
#2  0x00007fffdd19362e in ?? () from /opt/rocm/lib/libamdhip64.so.5
#3  0x00007fffdd15e186 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#4  0x00007fffdd163abc in hipLaunchKernel () from /opt/rocm/lib/libamdhip64.so.5
#5  0x00007fffdeecdae0 in void at::native::gpu_kernel_impl<at::native::CUDAFunctor_add<c10::Half> >(at::TensorIteratorBase&, at::native::CUDAFunctor_add<c10::Half> const&) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#6  0x00007fffdeea98b1 in at::native::add_kernel(at::TensorIteratorBase&, c10::Scalar const&) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#7  0x00007fffe0231083 in at::(anonymous namespace)::wrapper_CUDA_add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
   from /dockerx/pytorch/torch/lib/libtorch_hip.so
#8  0x00007fffebae07fb in at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) () from /dockerx/pytorch/torch/lib/libtorch_cpu.so
#9  0x00007fffe0076353 in at::native::miopen_convolution_add_bias_(char const*, at::TensorArg const&, at::TensorArg const&) ()
   from /dockerx/pytorch/torch/lib/libtorch_hip.so
#10 0x00007fffe0077715 in at::native::miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#11 0x00007fffe01eb538 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#12 0x00007fffe01eb645 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) ()
--Type <RET> for more, q to quit, c to continue without paging--

Additional information

use this docker
https://hub.docker.com/layers/rocm/pytorch/latest-release/images/sha256-9cd6cb2aea005f706799d26d43503e0096b01aa723b19f210dc98d2d466b683b?context=explore

The text was updated successfully, but these errors were encountered:

MHBZHY · 2023-07-10T03:01:47Z

host os: Linux fedora 6.3.8-200.fc38.x86_64

senchpimy · 2023-07-11T03:39:50Z

Same seg fault on arch with 7900xt

jonhoo · 2023-07-12T20:14:35Z

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

MHBZHY · 2023-07-13T03:53:48Z

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

Well, mine is 5800x3d, which has no igpu. I also test rocm5.5 docker image, it works well. Only rocm5.6 docker image has this issue.

YourSandwich · 2023-08-04T22:49:37Z

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

Thanks that worked!!!

MHBZHY · 2023-08-22T08:49:13Z

using newest rocm/pytorch-nightly docker image has fixed this issus. maybe only an amd problem.

MHBZHY added the bug-report Report of a bug, yet to be confirmed label Jul 10, 2023

jonhoo mentioned this issue Jul 13, 2023

[Bug]: Wiki instructions for AMD and Arch Linux are outdated #11754

Closed

1 task

shibe2 mentioned this issue Jul 29, 2023

hipamd: SIGSEGV when code for particular device architecture is absent ROCm/clr#4

Closed

viebrix mentioned this issue Aug 4, 2023

Possible to update PyTorch build to support Torch 1.13.1 Rocm5.2? xuhuisheng/rocm-gfx803#27

Open

catboxanon added the platform:amd Issues that apply to AMD manufactured cards label Aug 7, 2023

MHBZHY closed this as completed Aug 22, 2023

viebrix mentioned this issue Sep 6, 2023

[help-with-local-system]: Segmentation fault RX580 #12376

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

MHBZHY commented Jul 10, 2023

MHBZHY commented Jul 10, 2023 •

edited

Loading

senchpimy commented Jul 11, 2023

jonhoo commented Jul 12, 2023

MHBZHY commented Jul 13, 2023

YourSandwich commented Aug 4, 2023

MHBZHY commented Aug 22, 2023

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

Comments

MHBZHY commented Jul 10, 2023

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Version or Commit where the problem happens

What Python version are you running on ?

What platforms do you use to access the UI ?

What device are you running WebUI on?

Cross attention optimization

What browsers do you use to access the UI ?

Command Line Arguments

List of extensions

Console logs

Additional information

MHBZHY commented Jul 10, 2023 • edited Loading

senchpimy commented Jul 11, 2023

jonhoo commented Jul 12, 2023

MHBZHY commented Jul 13, 2023

YourSandwich commented Aug 4, 2023

MHBZHY commented Aug 22, 2023

MHBZHY commented Jul 10, 2023 •

edited

Loading