Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

Closed
1 task done
MHBZHY opened this issue Jul 10, 2023 · 6 comments
Closed
1 task done

[Bug]: 7900XTX with rocm/pytorch 5.6 docker image: Segmentation fault #11712

MHBZHY opened this issue Jul 10, 2023 · 6 comments
Labels
bug-report Report of a bug, yet to be confirmed platform:amd Issues that apply to AMD manufactured cards

Comments

@MHBZHY
Copy link

MHBZHY commented Jul 10, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

Running launch.py then Segmentation fault. Both conda torch and manual installed torch in venv.

Steps to reproduce the problem

  1. Activate docker container then just run /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test, get Segmentation fault
  2. Use export PYTORCH_ROCM_ARCH="gfx1100" to manually install torch & torchvision in venv.
  3. Use new venv to run launch.py, get Segmentation fault again

What should have happened?

I have run it successful on rocm5.5 docker.

Version or Commit where the problem happens

1.4.0

What Python version are you running on ?

Python 3.9.x (below, no recommended)

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

AMD GPUs (RX 6000 above)

Cross attention optimization

Automatic

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test

List of extensions

no

Console logs

(gdb) run /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
Starting program: /dockerx/stable-diffusion-webui/venv/bin/python /dockerx/stable-diffusion-webui/launch.py --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 533]
[Detaching after fork from child process 534]
fatal: not a git repository (or any of the parent directories): .git
[Detaching after fork from child process 535]
fatal: not a git repository (or any of the parent directories): .git
Python 3.8.16 (default, Jun 12 2023, 18:09:05) 
[GCC 11.2.0]
Version: ## 1.4.0
Commit hash: <none>
[Detaching after fork from child process 536]
[Detaching after fork from child process 538]
[Detaching after fork from child process 540]
[Detaching after fork from child process 542]
Installing requirements
[Detaching after fork from child process 544]
Launching Web UI with arguments: --no-half-vae --listen --enable-insecure-extension-access --skip-torch-cuda-test
[New Thread 0x7ffee73ff700 (LWP 550)]
[New Thread 0x7ffee6bfe700 (LWP 551)]
[New Thread 0x7ffee23fd700 (LWP 552)]
[New Thread 0x7ffedfbfc700 (LWP 553)]
[New Thread 0x7ffedd3fb700 (LWP 554)]
[New Thread 0x7ffedabfa700 (LWP 555)]
[New Thread 0x7ffed83f9700 (LWP 556)]
[New Thread 0x7ffed5bf8700 (LWP 557)]
[New Thread 0x7ffed33f7700 (LWP 558)]
[New Thread 0x7ffed0bf6700 (LWP 559)]
[New Thread 0x7ffece3f5700 (LWP 560)]
[New Thread 0x7ffecbbf4700 (LWP 561)]
[New Thread 0x7ffec93f3700 (LWP 562)]
[New Thread 0x7ffec6bf2700 (LWP 563)]
[New Thread 0x7ffec43f1700 (LWP 564)]
[New Thread 0x7ffebe9af700 (LWP 565)]
[New Thread 0x7ffdbdfff700 (LWP 566)]
[Thread 0x7ffdbdfff700 (LWP 566) exited]
warning: Loadable section ".note.gnu.property" outside of ELF segments
No module 'xformers'. Proceeding without it.
[New Thread 0x7ffda9146700 (LWP 567)]
[New Thread 0x7ffd9c1ff700 (LWP 568)]
[New Thread 0x7ffd9b9fe700 (LWP 569)]
[New Thread 0x7ffd971fd700 (LWP 570)]
[New Thread 0x7ffd949fc700 (LWP 571)]
[New Thread 0x7ffd921fb700 (LWP 572)]
[New Thread 0x7ffd8f9fa700 (LWP 573)]
[New Thread 0x7ffd8f1f9700 (LWP 574)]
[New Thread 0x7ffd8a9f8700 (LWP 575)]
[New Thread 0x7ffd881f7700 (LWP 576)]
[New Thread 0x7ffd859f6700 (LWP 577)]
[New Thread 0x7ffd831f5700 (LWP 578)]
[New Thread 0x7ffd809f4700 (LWP 579)]
[New Thread 0x7ffd7e1f3700 (LWP 580)]
[New Thread 0x7ffd7b9f2700 (LWP 581)]
[New Thread 0x7ffd7b1f1700 (LWP 582)]
[Thread 0x7ffd7b1f1700 (LWP 582) exited]
[Thread 0x7ffd7b9f2700 (LWP 581) exited]
[Thread 0x7ffd7e1f3700 (LWP 580) exited]
[Thread 0x7ffd809f4700 (LWP 579) exited]
[Thread 0x7ffd831f5700 (LWP 578) exited]
[Thread 0x7ffd859f6700 (LWP 577) exited]
[Thread 0x7ffd881f7700 (LWP 576) exited]
[Thread 0x7ffd8a9f8700 (LWP 575) exited]
[Thread 0x7ffd8f1f9700 (LWP 574) exited]
[Thread 0x7ffd8f9fa700 (LWP 573) exited]
[Thread 0x7ffd921fb700 (LWP 572) exited]
[Thread 0x7ffd949fc700 (LWP 571) exited]
[Thread 0x7ffd971fd700 (LWP 570) exited]
[Thread 0x7ffd9b9fe700 (LWP 569) exited]
[Thread 0x7ffd9c1ff700 (LWP 568) exited]
[Thread 0x7ffec43f1700 (LWP 564) exited]
[Thread 0x7ffec6bf2700 (LWP 563) exited]
[Thread 0x7ffec93f3700 (LWP 562) exited]
[Thread 0x7ffecbbf4700 (LWP 561) exited]
[Thread 0x7ffece3f5700 (LWP 560) exited]
[Thread 0x7ffed0bf6700 (LWP 559) exited]
[Thread 0x7ffed33f7700 (LWP 558) exited]
[Thread 0x7ffed5bf8700 (LWP 557) exited]
[Thread 0x7ffed83f9700 (LWP 556) exited]
[Thread 0x7ffedabfa700 (LWP 555) exited]
[Thread 0x7ffedd3fb700 (LWP 554) exited]
[Thread 0x7ffedfbfc700 (LWP 553) exited]
[Thread 0x7ffee23fd700 (LWP 552) exited]
[Thread 0x7ffee6bfe700 (LWP 551) exited]
[Thread 0x7ffee73ff700 (LWP 550) exited]
[Detaching after fork from child process 583]
[Detaching after fork from child process 584]
[New Thread 0x7ffec43f1700 (LWP 585)]
Loading weights [e714ee20aa] from /dockerx/stable-diffusion-webui/models/Stable-diffusion/abyssorangemix2_Hard.safetensors
[New Thread 0x7ffec6bf2700 (LWP 586)]
loading settings: JSONDecodeError
[New Thread 0x7ffec93f3700 (LWP 587)]
[Thread 0x7ffec93f3700 (LWP 587) exited]
Traceback (most recent call last):
  File "/dockerx/stable-diffusion-webui/modules/ui_loadsave.py", line 26, in __init__
    self.ui_settings = self.read_from_file()
  File "/dockerx/stable-diffusion-webui/modules/ui_loadsave.py", line 117, in read_from_file
    return json.load(file)
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/envs/py_3.8/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

[New Thread 0x7ffec93f3700 (LWP 588)]
preload_extensions_git_metadata for 7 extensions took 0.00s
--Type <RET> for more, q to quit, c to continue without paging--c

Thread 36 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffec6bf2700 (LWP 586)]
0x00007fffdd01c519 in ?? () from /opt/rocm/lib/libamdhip64.so.5
(gdb) bt
#0  0x00007fffdd01c519 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#1  0x00007fffdcfd091c in ?? () from /opt/rocm/lib/libamdhip64.so.5
#2  0x00007fffdd19362e in ?? () from /opt/rocm/lib/libamdhip64.so.5
#3  0x00007fffdd15e186 in ?? () from /opt/rocm/lib/libamdhip64.so.5
#4  0x00007fffdd163abc in hipLaunchKernel () from /opt/rocm/lib/libamdhip64.so.5
#5  0x00007fffdeecdae0 in void at::native::gpu_kernel_impl<at::native::CUDAFunctor_add<c10::Half> >(at::TensorIteratorBase&, at::native::CUDAFunctor_add<c10::Half> const&) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#6  0x00007fffdeea98b1 in at::native::add_kernel(at::TensorIteratorBase&, c10::Scalar const&) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#7  0x00007fffe0231083 in at::(anonymous namespace)::wrapper_CUDA_add__Tensor(at::Tensor&, at::Tensor const&, c10::Scalar const&) ()
   from /dockerx/pytorch/torch/lib/libtorch_hip.so
#8  0x00007fffebae07fb in at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) () from /dockerx/pytorch/torch/lib/libtorch_cpu.so
#9  0x00007fffe0076353 in at::native::miopen_convolution_add_bias_(char const*, at::TensorArg const&, at::TensorArg const&) ()
   from /dockerx/pytorch/torch/lib/libtorch_hip.so
#10 0x00007fffe0077715 in at::native::miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#11 0x00007fffe01eb538 in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) () from /dockerx/pytorch/torch/lib/libtorch_hip.so
#12 0x00007fffe01eb645 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__miopen_convolution>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) ()
--Type <RET> for more, q to quit, c to continue without paging--

Additional information

use this docker
https://hub.docker.com/layers/rocm/pytorch/latest-release/images/sha256-9cd6cb2aea005f706799d26d43503e0096b01aa723b19f210dc98d2d466b683b?context=explore

@MHBZHY MHBZHY added the bug-report Report of a bug, yet to be confirmed label Jul 10, 2023
@MHBZHY
Copy link
Author

MHBZHY commented Jul 10, 2023

host os: Linux fedora 6.3.8-200.fc38.x86_64

@senchpimy
Copy link

Same seg fault on arch with 7900xt

@jonhoo
Copy link

jonhoo commented Jul 12, 2023

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

@MHBZHY
Copy link
Author

MHBZHY commented Jul 13, 2023

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

Well, mine is 5800x3d, which has no igpu. I also test rocm5.5 docker image, it works well. Only rocm5.6 docker image has this issue.

@YourSandwich
Copy link

In my case I found a solution via this comment — if you also have a fancy AMD CPU with a built-in iGPU (like the Ryzen 7000 series), then you need to add export ROCR_VISIBLE_DEVICES=0 to your webui-user.sh. May be a good candidate to add to the AMD on Linux wiki if it's updated (#11754).

I also used the dev branch in case that matters to others, so that I would get ed85578 (which came via #11228). If you already have a venv set up, I think you need to manually run the TORCH_COMMAND from that diff if you need that change as well.

Thanks that worked!!!

@catboxanon catboxanon added the platform:amd Issues that apply to AMD manufactured cards label Aug 7, 2023
@MHBZHY
Copy link
Author

MHBZHY commented Aug 22, 2023

using newest rocm/pytorch-nightly docker image has fixed this issus. maybe only an amd problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed platform:amd Issues that apply to AMD manufactured cards
Projects
None yet
Development

No branches or pull requests

5 participants