Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support gpu for metal m1 (BUG AND FEATURE) #121

Closed
alexdieudonne opened this issue Feb 24, 2023 · 16 comments
Closed

Add support gpu for metal m1 (BUG AND FEATURE) #121

alexdieudonne opened this issue Feb 24, 2023 · 16 comments

Comments

@alexdieudonne
Copy link

Hello, everyone i like a lot your work, this is amazing to use our own computer to train and have our own ai to use instead of chat gpt, I have an ask pls can you help me to make my mac m1 run this using metal it crash everytime i use the gpu to run the model it tells me that torch doesn't work with cuda and it is true torch.cuda.is_available() == false
I really want to make it works please help me.

@oobabooga
Copy link
Owner

Can you try installing pytorch from the website? There is an option for mac there.

https://pytorch.org/get-started/locally/

@alexdieudonne
Copy link
Author

Hello thank you for your answer
But it says that Mac doesn't support coda but I can still use the mps acceleration but when I do this I have to replace all the cuda usage within the project, and I'm not really used to Python ai so it's hard for me to figure out what to change to make it works

image

@oobabooga oobabooga added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Feb 25, 2023
@oobabooga
Copy link
Owner

I don't have a Mac computer to experiment with, but maybe someone else can help with this.

@Spencer-Dawson
Copy link
Contributor

I also don't have an m1 mac to try this on, but I think you need to use the "Default" compute platform and not CUDA or ROCm.
image
Going off of some random blog post I think MPS stands for Metal Performance Shaders which I assume is what you're asking for.

@Spencer-Dawson
Copy link
Contributor

So basically the install instructions would probably be

conda create -n textgen
conda activate textgen
conda install pytorch torchvision torchaudio -c pytorch
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

If you already have the wrong version of pytorch installed uninstall it before running the pytorch install command for mac

@GundamWing
Copy link

I've actually been playing around with this quite a bit on a previous iteration of oobabooga. There's a few speed bumps with getting it running at the moment.
First, some of oobabooga's code will need to be revamped. I don't think it's too major, but there needs to be some abstraction from CUDA to MPS to talk to the Metal Shader level. Some code to determine if the Metal layer is available and then wiring it up.
The bigger speed bump is PyTorch itself. Their MPS implementation isn't complete. Even once I got it wired up properly, there are some errors for unsupported methods and mismatched tensors. PyTorch is working on a lot of those so it may just be a matter of time.
The third speed bump is RAM. M1/M2 Macs have shared RAM between the CPU and GPU so the basic 8GB and 16GB MacBooks/Air/iMacs will never be able to really do much. Once you get up to the 32GB+ machines more will likely be possible.

@dalnk
Copy link

dalnk commented Mar 6, 2023

I have a 32GB m2 pro and would love to get this working. so far it only runs in CPU mode and performs fine. 4chanGPT works very fast, LLaMA runs quite slow and 13B doesn't really load (I think I would need even more ram)

should I try and tell it to use MPS backend and see which parts aren't implemented?

@GundamWing
Copy link

I have a 32GB m2 pro and would love to get this working. so far it only runs in CPU mode and performs fine. 4chanGPT works very fast, LLaMA runs quite slow and 13B doesn't really load (I think I would need even more ram)
should I try and tell it to use MPS backend and see which parts aren't implemented?

At this point, I think it's best to wait. There's work being done by PyTorch and without some of those underlying basic functions, it will be playing Whack-a-Mole trying to get this to work. But, if you want to give it a go, I'd love to know what you alter in the code to make progress.

@GundamWing
Copy link

Well, I got it hooked up using MPS backend. There's more than a few caveats and I'm not sure why some things are happening. It is a memory pig on even the 1.3b Pyg. I don't know if that's because of how it's allocating memory between CPU and GPU, but it definitely is using the GPU to generate responses. If anyone is interested, I can post the code and my settings to share. I'm working on a 32 GB M2 Mac mini. If you're on anything less than 32GB of RAM, this will likely just not work at all. I'd love to try it on a high end Studio with 128 GB of RAM to see how it fares.

@dalnk
Copy link

dalnk commented Mar 7, 2023

Yeah I'll consider an even beefier m3 with 128gb of ram but for now the whole machine learning community seems kneecapped by the pytorch implementation being half done (especially compared to CUDA)

if you drop the code you have so far here or on your github page that would be massively helpful in getting started!

@GundamWing
Copy link

I've attached the files I made alterations to based off yesterday's git clone of the oob repo. You should be able to just replace those. You'll need the PyTorch nightly build rather than stable. I'd also suggest the 2.7B Pyg model to start with and see how you do. It may generate gibberish or it might generate an amazing few blocks of conversation depending on what magic happens in the GPU/RAM. Make sure your topk is set to 15 or less. For some reason, MPS can't handle anything over 15. Once you have the updated PyTorch and copied the files to the right spots, run the server.py with a --mps flag and it should start using both CPU and GPU. It causes a seg fault on Intel Mac, but runs on M2.
AppleMPSExeperiment.zip

@dalnk
Copy link

dalnk commented Mar 8, 2023

I got the pygmalion-2.7B model running! Thank you, one step closer to the dream :)
image

No surprise it was very upset at trying to actually run the LLaMA model, but I'll slowly work through the errors its hitting. Seems to be the wrong format of model for mps, maybe need to load it in 8bit mode?

It at least loaded the model, with no additional changes.

Loading LLaMA-7B...
Loading checkpoint shards: 100%|███████████████████████████| 33/33 [00:11<00:00,  3.00it/s]
Loaded the model in 29.32 seconds.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
  0%|                                                               | 0/26 [00:00<?, ?it/s]/Users/dalnk/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py:686: UserWarning: The operator 'aten::repeat_interleave.self_int' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.)
  input_ids = input_ids.repeat_interleave(expand_size, dim=0)
libc++abi: terminating with uncaught exception of type c10::TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
Exception raised from getMPSScalarType at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:115 (most recent call first):
frame #0: at::native::mps::getMPSScalarType(c10::ScalarType) + 180 (0x295bea4e8 in libtorch_cpu.dylib)
frame #1: invocation function for block in at::native::mps::createViewGraph(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, bool) + 172 (0x295c9d6d0 in libtorch_cpu.dylib)
frame #2: invocation function for block in at::native::mps::MPSGraphCache::CreateCachedGraph(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, at::native::mps::MPSCachedGraph* () block_pointer, void*) + 216 (0x295bfd610 in libtorch_cpu.dylib)
frame #3: _dispatch_client_callout + 20 (0x1a14a2504 in libdispatch.dylib)
frame #4: _dispatch_lane_barrier_sync_invoke_and_complete + 56 (0x1a14b1a9c in libdispatch.dylib)
frame #5: at::native::mps::MPSGraphCache::CreateCachedGraph(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, at::native::mps::MPSCachedGraph* () block_pointer, void*) + 168 (0x295bedfcc in libtorch_cpu.dylib)
frame #6: at::native::mps::createViewGraph(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, bool) + 560 (0x295c9cfd8 in libtorch_cpu.dylib)
frame #7: at::native::as_strided_tensorimpl_mps(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, c10::optional<long long>) + 348 (0x295c9d178 in libtorch_cpu.dylib)
frame #8: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>), &(at::(anonymous namespace)::(anonymous namespace)::wrapper__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt> > >, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) + 168 (0x2934fa9d4 in libtorch_cpu.dylib)
frame #9: at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt> >(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>&&, c10::ArrayRef<c10::SymInt>&&, c10::optional<c10::SymInt>&&) + 88 (0x2924b639c in libtorch_cpu.dylib)
frame #10: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) + 384 (0x2923d4344 in libtorch_cpu.dylib)
frame #11: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, c10::optional<long long>) const + 144 (0x2917d61e0 in libtorch_cpu.dylib)
frame #12: at::native::slice(at::Tensor const&, long long, c10::optional<long long>, c10::optional<long long>, long long) + 744 (0x291fe1f60 in libtorch_cpu.dylib)
frame #13: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_Tensor_slice(at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt> >, at::Tensor (at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 136 (0x292d3aba0 in libtorch_cpu.dylib)
frame #14: at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long&&, c10::optional<c10::SymInt>&&, c10::optional<c10::SymInt>&&, c10::SymInt&&) + 116 (0x292a70d78 in libtorch_cpu.dylib)
frame #15: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt)> const&, c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) const + 196 (0x292a71f30 in libtorch_cpu.dylib)
frame #16: at::_ops::slice_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 264 (0x2929ded58 in libtorch_cpu.dylib)
frame #17: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt), &(torch::ADInplaceOrView::(anonymous namespace)::slice_Tensor(c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 276 (0x294b892c4 in libtorch_cpu.dylib)
frame #18: at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long&&, c10::optional<c10::SymInt>&&, c10::optional<c10::SymInt>&&, c10::SymInt&&) + 116 (0x292a70d78 in libtorch_cpu.dylib)
frame #19: at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt)> const&, c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) const + 196 (0x292a71f30 in libtorch_cpu.dylib)
frame #20: at::_ops::slice_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 264 (0x2929ded58 in libtorch_cpu.dylib)
frame #21: at::redispatch::slice_symint(c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 148 (0x2945eda60 in libtorch_cpu.dylib)
frame #22: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt), &(torch::autograd::VariableType::(anonymous namespace)::slice_Tensor(c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 1328 (0x2945eb520 in libtorch_cpu.dylib)
frame #23: at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long long&&, c10::optional<c10::SymInt>&&, c10::optional<c10::SymInt>&&, c10::SymInt&&) + 116 (0x292a70d78 in libtorch_cpu.dylib)
frame #24: at::_ops::slice_Tensor::call(at::Tensor const&, long long, c10::optional<c10::SymInt>, c10::optional<c10::SymInt>, c10::SymInt) + 472 (0x2929dd698 in libtorch_cpu.dylib)
frame #25: at::Tensor::slice(long long, c10::optional<long long>, c10::optional<long long>, long long) const + 112 (0x13be831bc in libtorch_python.dylib)
frame #26: at::indexing::get_item(at::Tensor const&, c10::ArrayRef<at::indexing::TensorIndex> const&, bool) + 352 (0x13be7d42c in libtorch_python.dylib)
frame #27: torch::autograd::THPVariable_getitem(_object*, _object*) + 2080 (0x13be7beb4 in libtorch_python.dylib)
frame #28: _PyEval_EvalFrameDefault + 6364 (0x10494a564 in python3.10)
frame #29: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #30: method_vectorcall + 124 (0x10484083c in python3.10)
frame #31: _PyEval_EvalFrameDefault + 33608 (0x104950fd0 in python3.10)
frame #32: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #33: _PyObject_Call_Prepend + 312 (0x10483dc0c in python3.10)
frame #34: slot_tp_call + 232 (0x1048c68d0 in python3.10)
frame #35: call_function + 656 (0x104971e68 in python3.10)
frame #36: _PyEval_EvalFrameDefault + 8332 (0x10494ad14 in python3.10)
frame #37: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #38: method_vectorcall + 124 (0x10484083c in python3.10)
frame #39: _PyEval_EvalFrameDefault + 33608 (0x104950fd0 in python3.10)
frame #40: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #41: _PyObject_Call_Prepend + 312 (0x10483dc0c in python3.10)
frame #42: slot_tp_call + 232 (0x1048c68d0 in python3.10)
frame #43: call_function + 656 (0x104971e68 in python3.10)
frame #44: _PyEval_EvalFrameDefault + 8332 (0x10494ad14 in python3.10)
frame #45: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #46: method_vectorcall + 124 (0x10484083c in python3.10)
frame #47: _PyEval_EvalFrameDefault + 33608 (0x104950fd0 in python3.10)
frame #48: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #49: _PyObject_Call_Prepend + 312 (0x10483dc0c in python3.10)
frame #50: slot_tp_call + 232 (0x1048c68d0 in python3.10)
frame #51: call_function + 656 (0x104971e68 in python3.10)
frame #52: _PyEval_EvalFrameDefault + 8332 (0x10494ad14 in python3.10)
frame #53: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #54: method_vectorcall + 124 (0x10484083c in python3.10)
frame #55: _PyEval_EvalFrameDefault + 33608 (0x104950fd0 in python3.10)
frame #56: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #57: _PyObject_Call_Prepend + 312 (0x10483dc0c in python3.10)
frame #58: slot_tp_call + 232 (0x1048c68d0 in python3.10)
frame #59: _PyEval_EvalFrameDefault + 33808 (0x104951098 in python3.10)
frame #60: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)
frame #61: method_vectorcall + 124 (0x10484083c in python3.10)
frame #62: _PyEval_EvalFrameDefault + 33608 (0x104950fd0 in python3.10)
frame #63: _PyFunction_Vectorcall + 548 (0x10483c178 in python3.10)

zsh: abort      python server.py --chat --listen --mps

@dalnk
Copy link

dalnk commented Mar 10, 2023

https://github.com/ggerganov/llama.cpp looks like someone figured out LLaMA 7B on apple silicon in case anyone here is interested!

@GundamWing
Copy link

Looks like they managed to hook things up properly. You'll need 13.3 beta to make it work, but getting great results on my machine now. #393

@appleguy
Copy link

appleguy commented Apr 12, 2023

@GundamWing Thank you for sharing your work! The default repository, even today, does not seem to support --mps as an argument to server.py — is that expected? Would it make sense to merge in the changes in your .zip via a PR to the main repository?

Also, what checkout hash did you use for the changes? Not surprisingly, dropping them in as-is to the current repo does not work. A .patch file might be a good approach, especially if you can pull / rebase and then upload the output of git diff.

I have all of this stuff running on a 3090, but am quite keen to try out the MPS version on my M1 Max 64GB to load larger models as the memory usage stuff is figured out (and/or contribute to bug reports for the platform).

@dalnk
Copy link

dalnk commented Apr 12, 2023

Afaik the zip contains parts of the MPS code working for Pygmalion

It has not been tested for LLaMA and other models. I haven't checked since my initial reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants