How to use a GPU to load the model into #409
Replies: 3 comments 1 reply
-
@bizrockman - for GGUF model, it will run on CPU without any modification. If you are on a Mac Metal (M1/M2/M3), it will automatically leverage the Metal GPU capability and you should see relatively fast inference. Have you run the test successfully? |
Beta Was this translation helpful? Give feedback.
-
+1, would also like to know how to control model layers offloading if running on Linux machine with GPU. |
Beta Was this translation helpful? Give feedback.
-
+1, set up GPU VRAM loading in LLMWare, first locate GPU offloading settings, likely mentioned in the Hugging Face model class. This setting, such as GPU_Offload, dictates how many model layers load into VRAM. Apply this parameter during model loading, adjusting it based on performance needs. Test the setup by monitoring VRAM usage during inference to ensure proper loading. Adjust as necessary for optimal performance. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am try to find the place where I can tell LLMWare to load the models into the GPU VRAM.
The only hint I found was in the Modell Class of Huggine Face models with a fix GPU_Offload of 50 layers. But I have no clue how this is meant to be used.
Here my Hello World with llmware.
prompter = Prompt().load_model("TheBloke/OpenHermes-2.5-Mistral-7B-GGUF") response = prompter.completion("What is the meaning of life?") print(response)
What do I need to do you load the model in the GPU VRAM or decide if a model should be loaded into RAM or VRAM?
Thx!
Beta Was this translation helpful? Give feedback.
All reactions