-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starcoder mmap (and gpu) example #338
Conversation
if (i > embd_inp.size()) { | ||
t_predict_us += ggml_time_us() - t_start_us; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed inconsistent ms/token stats when seeing the effect of putting layers on the GPU, so added this to avoid counting input processing in the prediction time.
//Shouldnt the input prompt be subracted? | ||
printf("%s: predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/(n_past - embd_inp.size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And subtracted embd_inp here.
Cool! Long term the goal is to move support for all models into But in any case this is useful - I'll think about if we want to merge it. |
Awesome! Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take? I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp? |
When I run the sh
the GPU UTL seems not change(keep 0%) during inference. is this correct? |
Run ./starcoder-mmap if you have built this branch. |
204478f
to
5e4ca81
Compare
5e4ca81
to
17b4ad7
Compare
thank you for the reply. after making
CUDA Driver Version is run
the inference didn't work correct:
|
When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes. You can try a positive temperature if you would like more output from that particular prompt. |
It works well when temperature is 0.2. Thank you for the awesome job. The acceleration effect is good (ngl 40: 300 ms/token -> 180 ms/token ) Are there any plans to support multiple GPU in the future |
This branch demonstrates sample GPU inference of Starcoder. I just synced it with the latest CUDA code from diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index dc4b773..a0b4988 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -2459,7 +2459,7 @@ inline void ggml_cuda_op_mul_mat_vec(
src0->type == GGML_TYPE_Q5_1 ||
src0->type == GGML_TYPE_Q8_0;
- const bool use_mul_mat_vec_q = g_compute_capabilities[id] >= 610 && mul_mat_vec_q_implemented;
+ const bool use_mul_mat_vec_q = false;
#endif
if (use_mul_mat_vec_q) { So it might indicate some issue in those kernels and it's probably worth looking into it. Easiest steps to repro:
You need to offload just 1 layer to trigger the issue, but you can also offload more. |
Not sure if this is really worthy of adding to the repo, but I have got mmap loading of starcoder based models working, this allows their use on systems with 16GB of ram where it wasn't possible before.
Have also added a few lines to run the layers that can be on the GPU with cuda or clblast (clblast version taken from koboldcpp). On my limited system this improves token latency from 380ms/t to 330ms (only 8GB so 20 layers offloaded).
I copied the mmap stuff mostly from ggerganov/llama.cpp#613, have only tested on linux but it seems to work as expected.
Apologies if the changes are hard to review, I figured making a new file was cleaner then adding more changes that would only apply to this example to common.cpp/h to pass through options for turning mmap on and off. Is easier to read by diffing starcoder-mmap.cpp with main.cpp.