[Question] Can LLava inference on CPU? #865

wenli135 · 2023-11-27T13:31:44Z

Question

I was trying to run LLava inference on cpu, but it complains "Torch not compiled with CUDA enabled". I noticed that cuda() is called when loading model. If I remove all the cuda() invocation, is it possible to run inference on cpu?

thanks.

papasanimohansrinivas · 2023-11-27T19:02:38Z

you need to install torch cpu and set device map to cpu in model loading side @wenli135

morteza102030 · 2023-11-29T13:22:56Z

you need to install torch cpu and set device map to cpu in model loading side @wenli135

it's possible for you give a complete example for how run LLaVA_13b_4bit_vanilla_colab without gpu?

akkimind · 2023-12-01T08:34:52Z

I made some changes in the code to run inference on CPU, the model is loading but I am getting an error:
BF16 weight prepack needs the cpu support avx512bw, avx512vl and avx512dq, please set dtype to torch.float or set weights_prepack to False
while trying to optimize the model(model = ipex.optimize(model, dtype=torch.bfloat16))
If I set dtype to torch.float, model isn’t supporting it and if set weights_prepack to False, model is taking forever to return response. Is there any Specific CPU which I should use?

ratan · 2024-01-09T09:06:39Z

did anyone able to run Llava inference on CPU without installing Intel Extention for Pytorch environment for inference? Any pointer will be really helpful

feng-intel · 2024-01-17T23:58:12Z

Hi Ratan
Here is the bare metal intel cpu solution intel xFasterTransformer for LLM, but there is no llava support yet. You can try firstly.
llama.cpp also support CPU. We will enable intel dGPU/iGPU later.

Could you tell why you don't want to use Intel Extention for Pytorch? Thanks.

drzraf · 2024-09-01T04:53:16Z

Tried some of this paths:

llama.cpp. I tried to convert_hf_to_gguf.py the HF model (to process LlavaMistralForCausalLM the way it does for LlamaForCausalLM), but stumbled upon other problems (Can not map tensor 'model.image_newline')

So, natively, from HF:

With low_cpu_mem_usage = False

transformers/modeling_utils.py; ValueError: Passing along a device_map requires low_cpu_mem_usage=True

With low_cpu_mem_usage = True

You can't pass load_in_4bit or load_in_8bit as a kwarg when passing quantization_config argument at the same time
(mentioned/replied to in #1638)

Fixing the above we get

transformers/quantizers/quantizer_bnb_4bit.py : ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model

which I don't explain because using load_pretrained_model(load_4bit=True, device='cpu') leads to a device_map = {'': 'cpu'} which is quite clear. Still, we can bypass this adding llm_int8_enable_fp32_cpu_offload=True to the BitsAndBytesConfig, but does it makes any sense with load_4bit?). Well, anyway, it loads (took only 10 minutes)

Now comes intel-extension-for-pytorch which indeed has a config for this model.
Whether ipex.optimize(inplace=True) is passed or not (if not, memory footprint is doubled), we get

RuntimeError: could not create a primitive descriptor for a convolution forward propagation primitive

=> blocked here.

Finally, regarding https://github.com/intel/xFasterTransformer, it's not quite clear whether it replace or complements intel-extension-for-pytorch [CPU/XPU] and for which specific hardware.

If any one could come up with answers/solutions for at least some of these, that'd be great.

feng-intel · 2024-09-02T02:08:31Z

For intel-extention-for-pytorch, it supports llava fp32,bf16,int8,int4 on intel CPU, iGPU and dGPU. Any issue , you can report issue on here. Someone from Intel will help you.
Ollama has supportted Intel CPU, iGPU, dGPU. You need to build from the source. The below is the llama3.1 steps for your reference.

$ git clone https://github.com/ollama/ollama.git
$ source intel/oneapi/setvars.sh

# Install go
$ wget https://go.dev/dl/go1.23.0.linux-amd64.tar.gz
$ mkdir ~/go_1.23.0 && tar zxf go1.23.0.linux-amd64.tar.gz -C ~/go_1.23.0
$ export PATH=$PATH:~/go_1.23.0/go/bin

$ cd ollama
$ go generate ./...
$ go build .    # ollama binary will be generated.

# Option to stop the before ollama service
$ ps -A |grep ollama
$ netstat -aon |grep 11434
$ sudo service ollama stop

# Start ollama server
$ OLLAMA_INTEL_GPU=1 ./ollama serve   ##if no "OLLAMA_INTEL_GPU=1"， it will run on cpu.

# Start ollama client to test
# Option 1
$ ./ollama run llama3.1
# Option 2
$ curl --noproxy "localhost" http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt":"Why is the sky blue?"
}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Can LLava inference on CPU? #865

[Question] Can LLava inference on CPU? #865

wenli135 commented Nov 27, 2023

papasanimohansrinivas commented Nov 27, 2023

morteza102030 commented Nov 29, 2023

akkimind commented Dec 1, 2023 •

edited

Loading

ratan commented Jan 9, 2024

feng-intel commented Jan 17, 2024

drzraf commented Sep 1, 2024

feng-intel commented Sep 2, 2024 •

edited

Loading

[Question] Can LLava inference on CPU? #865

[Question] Can LLava inference on CPU? #865

Comments

wenli135 commented Nov 27, 2023

Question

papasanimohansrinivas commented Nov 27, 2023

morteza102030 commented Nov 29, 2023

akkimind commented Dec 1, 2023 • edited Loading

ratan commented Jan 9, 2024

feng-intel commented Jan 17, 2024

drzraf commented Sep 1, 2024

feng-intel commented Sep 2, 2024 • edited Loading

akkimind commented Dec 1, 2023 •

edited

Loading

feng-intel commented Sep 2, 2024 •

edited

Loading