How to use a GPU to load the model into #409

bizrockman · 2024-02-12T18:25:54Z

bizrockman
Feb 12, 2024

Hi,

I am try to find the place where I can tell LLMWare to load the models into the GPU VRAM.

The only hint I found was in the Modell Class of Huggine Face models with a fix GPU_Offload of 50 layers. But I have no clue how this is meant to be used.

Here my Hello World with llmware.

prompter = Prompt().load_model("TheBloke/OpenHermes-2.5-Mistral-7B-GGUF") response = prompter.completion("What is the meaning of life?") print(response)

What do I need to do you load the model in the GPU VRAM or decide if a model should be loaded into RAM or VRAM?

Thx!

doberst · 2024-02-12T18:45:26Z

doberst
Feb 12, 2024
Maintainer

@bizrockman - for GGUF model, it will run on CPU without any modification. If you are on a Mac Metal (M1/M2/M3), it will automatically leverage the Metal GPU capability and you should see relatively fast inference. Have you run the test successfully?

1 reply

bizrockman Feb 12, 2024
Author

Yes I see that it is running on CPU without any modification.

But I like to move the model to the GPU instead.

I am using a PC with a dedicated GPU. So I need to tell a model loader that it should use the GPU instead or at least a way to offload some of the layers to the GPU.

vlebedev · 2024-03-01T10:51:00Z

vlebedev
Mar 1, 2024

+1, would also like to know how to control model layers offloading if running on Linux machine with GPU.

0 replies

DevJSter · 2024-05-09T18:07:37Z

DevJSter
May 9, 2024

+1, set up GPU VRAM loading in LLMWare, first locate GPU offloading settings, likely mentioned in the Hugging Face model class. This setting, such as GPU_Offload, dictates how many model layers load into VRAM. Apply this parameter during model loading, adjusting it based on performance needs. Test the setup by monitoring VRAM usage during inference to ensure proper loading. Adjust as necessary for optimal performance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a GPU to load the model into #409

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to use a GPU to load the model into #409

bizrockman Feb 12, 2024

Replies: 3 comments · 1 reply

doberst Feb 12, 2024 Maintainer

bizrockman Feb 12, 2024 Author

vlebedev Mar 1, 2024

DevJSter May 9, 2024

bizrockman
Feb 12, 2024

Replies: 3 comments 1 reply

doberst
Feb 12, 2024
Maintainer

bizrockman Feb 12, 2024
Author

vlebedev
Mar 1, 2024

DevJSter
May 9, 2024