Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starcoder mmap (and gpu) example #338

Merged
merged 4 commits into from
Jul 14, 2023

Conversation

johnson442
Copy link
Contributor

@johnson442 johnson442 commented Jul 3, 2023

Not sure if this is really worthy of adding to the repo, but I have got mmap loading of starcoder based models working, this allows their use on systems with 16GB of ram where it wasn't possible before.

Have also added a few lines to run the layers that can be on the GPU with cuda or clblast (clblast version taken from koboldcpp). On my limited system this improves token latency from 380ms/t to 330ms (only 8GB so 20 layers offloaded).

I copied the mmap stuff mostly from ggerganov/llama.cpp#613, have only tested on linux but it seems to work as expected.

Apologies if the changes are hard to review, I figured making a new file was cleaner then adding more changes that would only apply to this example to common.cpp/h to pass through options for turning mmap on and off. Is easier to read by diffing starcoder-mmap.cpp with main.cpp.

Comment on lines +1043 to +1045
if (i > embd_inp.size()) {
t_predict_us += ggml_time_us() - t_start_us;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed inconsistent ms/token stats when seeing the effect of putting layers on the GPU, so added this to avoid counting input processing in the prediction time.

Comment on lines +1110 to +1111
//Shouldnt the input prompt be subracted?
printf("%s: predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/(n_past - embd_inp.size()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And subtracted embd_inp here.

@ggerganov
Copy link
Owner

Cool!

Long term the goal is to move support for all models into llama.cpp where we already have the mmap and GPU machinery. And therefore, try to keep the ggml examples simple and minimalistic.

But in any case this is useful - I'll think about if we want to merge it.

@johnson442
Copy link
Contributor Author

johnson442 commented Jul 5, 2023

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

@llystar
Copy link

llystar commented Jul 5, 2023

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

When I run the sh

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the GPU UTL seems not change(keep 0%) during inference.

is this correct?

@johnson442
Copy link
Contributor Author

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

@llystar
Copy link

llystar commented Jul 5, 2023

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

thank you for the reply.

after making

cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc .. && make -j4 starcoder starcoder-quantize starcoder-mmap

CUDA Driver Version is 12.1
GPU is Tesla T4

run

./bin/starcoder-mmap -ngl 20 -t 24 -b 64 -m /model/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the inference didn't work correct:

Calling starcoder_eval
You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in "```". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    """ Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    """\n\n\nYour answer:\n```<|endoftext|>

main: mem per token =   462284 bytes
main:     load time =  5219.62 ms
main:   sample time =     0.36 ms
main:  predict time =     0.00 ms / -nan ms per token
main:    total time = 18967.41 ms

@johnson442
Copy link
Contributor Author

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

@llystar
Copy link

llystar commented Jul 5, 2023

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

It works well when temperature is 0.2.

Thank you for the awesome job. The acceleration effect is good (ngl 40: 300 ms/token -> 180 ms/token )

Are there any plans to support multiple GPU in the future

@ggerganov
Copy link
Owner

ggerganov commented Jul 14, 2023

@JohannesGaessler

This branch demonstrates sample GPU inference of Starcoder. I just synced it with the latest CUDA code from master and the inference breaks. I tried to trace it, and found that if I disable the mul_mat_vec_q kernels it works:

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index dc4b773..a0b4988 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -2459,7 +2459,7 @@ inline void ggml_cuda_op_mul_mat_vec(
         src0->type == GGML_TYPE_Q5_1 ||
         src0->type == GGML_TYPE_Q8_0;
 
-    const bool use_mul_mat_vec_q = g_compute_capabilities[id] >= 610 && mul_mat_vec_q_implemented;
+    const bool use_mul_mat_vec_q = false;
 #endif
 
     if (use_mul_mat_vec_q) {

So it might indicate some issue in those kernels and it's probably worth looking into it.

Easiest steps to repro:

### Human: Write a function to check a C string for valid UTF-8 encoding without using external libs in C++.
### Assistant: Sure, here's the function:
```cpp
  • Build with cmake -DGGML_CUBLAS=ON
  • Run:
./bin/starcoder-mmap -t 8 -m models/starcoder/starcoderplus-guanaco-gpt4.ggmlv1.q4_0.bin -n 4096 --top_p 0.3 --temp 1 --top_k 9999 -f p-prompt.txt -s 123 -ngl 1

You need to offload just 1 layer to trigger the issue, but you can also offload more.
The above steps currently generate gibberish and applying the single-line patch from above fixes it.

@ggerganov ggerganov merged commit d2b178e into ggerganov:master Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants