Run '.model' in ToolMate AI prompt and select 'llamacppserver' as LLM interface. This option is designed for advanced users who want more control over the LLM backend, particularly useful for customisation like GPU acceleration.
Basically, compile your customised copy of llama.cpp on your device and enter the server command in ToolMate AI configurations. It auto-starts the llama.cpp server when ToolMate AI starts.
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU. Therefore, compiling source is a simple one:
To compile llama.cpp from source:
cd ~
cd llama.cpp
make
To configure ToolMate AI:
-
Run 'toolmate' in your environment
-
Enter '.model' in ToolMate AI prompt.
-
Follow the instructions to enter command line, server ip, port and timeout settings.
To briefly explain the server command line above:
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(sysctl -n hw.physicalcpu) --ctx-size 0 --chat-template chatml --parallel 2 --model ~/models/wizardlm2.gguf
--threads $(sysctl -n hw.physicalcpu): set the threads to the number of physical CPU cores
--ctx-size: size of the prompt context (default: 0, 0 = loaded from model)
--parallel 2: set number of slots for process requests to 2
For more options:
cd llama.cpp
./server -h
Inference result is roughly 1.5x faster. Read https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/igpu_only/igpu_only.md
Tested device: Beelink GTR6 (Ryzen 9 6900HX CPU + integrated Radeon 680M GPU + 64GB RAM)
Followed https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/README.md for ROCm installation.
Environment variables:
export ROCM_HOME=/opt/rocm
export LD_LIBRARY_PATH=/opt/rocm/include:/opt/rocm/lib:$LD_LIBRARY_PATH
export PATH=$HOME/.local/bin:/opt/rocm/bin:/opt/rocm/llvm/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION=10.3.0
Compile Llama.cpp from source:
cd ~
make GGML_HIPBLAS=1 GGML_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
Enter full command line in ToolMate AI configurations as described in previous example:
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
Please note we used --gpu-layers
in the command above. You may want to change the its value 33 to suit your case.
--gpu-layers: number of layers to store in VRAM
Tested on Ubuntu with Dual AMD RX 7900 XTX. Full setup notes are documented at https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/blob/main/README.md
Compile Llama.cpp from source:
cd ~
make GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
Enter full command line in ToolMate AI configurations as described in previous examples:
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
Compile Llama.cpp from source:
cd ~
make GGML_CUDA=1 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')
Enter full command line in ToolMate AI configurations as described in previous examples:
~/llama.cpp/llama-server --host 127.0.0.1 --port 8080 --threads $(lscpu | grep '^Core(s)' | awk '{print $NF}') --ctx-size 0 --chat-template chatml --parallel 2 --gpu-layers 999 --model ~/models/wizardlm2.gguf
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build