Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Can LLava inference on CPU? #865

Open
wenli135 opened this issue Nov 27, 2023 · 7 comments
Open

[Question] Can LLava inference on CPU? #865

wenli135 opened this issue Nov 27, 2023 · 7 comments

Comments

@wenli135
Copy link

Question

I was trying to run LLava inference on cpu, but it complains "Torch not compiled with CUDA enabled". I noticed that cuda() is called when loading model. If I remove all the cuda() invocation, is it possible to run inference on cpu?

thanks.

@papasanimohansrinivas
Copy link

you need to install torch cpu and set device map to cpu in model loading side @wenli135

@morteza102030
Copy link

you need to install torch cpu and set device map to cpu in model loading side @wenli135

it's possible for you give a complete example for how run LLaVA_13b_4bit_vanilla_colab without gpu?

@akkimind
Copy link

akkimind commented Dec 1, 2023

I made some changes in the code to run inference on CPU, the model is loading but I am getting an error:
BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, please set dtype to torch.float or set weights_prepack to False
while trying to optimize the model(model = ipex.optimize(model, dtype=torch.bfloat16))
If I set dtype to torch.float, model isn’t supporting it and if set weights_prepack to False, model is taking forever to return response. Is there any Specific CPU which I should use?

@ratan
Copy link

ratan commented Jan 9, 2024

did anyone able to run Llava inference on CPU without installing Intel Extention for Pytorch environment for inference? Any pointer will be really helpful

@feng-intel
Copy link
Contributor

Hi Ratan
Here is the bare metal intel cpu solution intel xFasterTransformer for LLM, but there is no llava support yet. You can try firstly.
llama.cpp also support CPU. We will enable intel dGPU/iGPU later.

Could you tell why you don't want to use Intel Extention for Pytorch? Thanks.

@drzraf
Copy link

drzraf commented Sep 1, 2024

Tried some of this paths:

  • llama.cpp. I tried to convert_hf_to_gguf.py the HF model (to process LlavaMistralForCausalLM the way it does for LlamaForCausalLM), but stumbled upon other problems (Can not map tensor 'model.image_newline')

So, natively, from HF:

  1. With low_cpu_mem_usage = False

transformers/modeling_utils.py; ValueError: Passing along a device_map requires low_cpu_mem_usage=True

  1. With low_cpu_mem_usage = True

You can't pass load_in_4bit or load_in_8bit as a kwarg when passing quantization_config argument at the same time
(mentioned/replied to in #1638)

  1. Fixing the above we get

transformers/quantizers/quantizer_bnb_4bit.py : ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model

which I don't explain because using load_pretrained_model(load_4bit=True, device='cpu') leads to a device_map = {'': 'cpu'} which is quite clear. Still, we can bypass this adding llm_int8_enable_fp32_cpu_offload=True to the BitsAndBytesConfig, but does it makes any sense with load_4bit?). Well, anyway, it loads (took only 10 minutes)

  • Now comes intel-extension-for-pytorch which indeed has a config for this model.

  • Whether ipex.optimize(inplace=True) is passed or not (if not, memory footprint is doubled), we get

RuntimeError: could not create a primitive descriptor for a convolution forward propagation primitive

=> blocked here.

Finally, regarding https://github.com/intel/xFasterTransformer, it's not quite clear whether it replace or complements intel-extension-for-pytorch [CPU/XPU] and for which specific hardware.

If any one could come up with answers/solutions for at least some of these, that'd be great.

@feng-intel
Copy link
Contributor

feng-intel commented Sep 2, 2024

  1. For intel-extention-for-pytorch, it supports llava fp32,bf16,int8,int4 on intel CPU, iGPU and dGPU. Any issue , you can report issue on here. Someone from Intel will help you.
  2. Ollama has supportted Intel CPU, iGPU, dGPU. You need to build from the source. The below is the llama3.1 steps for your reference.
$ git clone https://github.com/ollama/ollama.git
$ source intel/oneapi/setvars.sh

# Install go
$ wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz
$ mkdir ~/go_1.23.0 && tar zxf go1.23.0.linux-amd64.tar.gz -C ~/go_1.23.0
$ export PATH=$PATH:~/go_1.23.0/go/bin

$ cd ollama
$ go generate ./...
$ go build .    # ollama binary will be generated.

# Option to stop the before ollama service
$ ps -A |grep ollama
$ netstat -aon |grep 11434
$ sudo service ollama stop

# Start ollama server
$ OLLAMA_INTEL_GPU=1 ./ollama serve   ##if no "OLLAMA_INTEL_GPU=1", it will run on cpu.

# Start ollama client to test
# Option 1
$ ./ollama run llama3.1
# Option 2
$ curl --noproxy "localhost" http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt":"Why is the sky blue?"
}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants