Starcoder mmap (and gpu) example #338

johnson442 · 2023-07-03T23:57:00Z

Not sure if this is really worthy of adding to the repo, but I have got mmap loading of starcoder based models working, this allows their use on systems with 16GB of ram where it wasn't possible before.

Have also added a few lines to run the layers that can be on the GPU with cuda or clblast (clblast version taken from koboldcpp). On my limited system this improves token latency from 380ms/t to 330ms (only 8GB so 20 layers offloaded).

I copied the mmap stuff mostly from ggerganov/llama.cpp#613, have only tested on linux but it seems to work as expected.

Apologies if the changes are hard to review, I figured making a new file was cleaner then adding more changes that would only apply to this example to common.cpp/h to pass through options for turning mmap on and off. Is easier to read by diffing starcoder-mmap.cpp with main.cpp.

examples/starcoder/starcoder-mmap.cpp

johnson442 · 2023-07-04T00:04:28Z

examples/starcoder/starcoder-mmap.cpp

+            if (i > embd_inp.size()) {
+                t_predict_us += ggml_time_us() - t_start_us;
+            }


I noticed inconsistent ms/token stats when seeing the effect of putting layers on the GPU, so added this to avoid counting input processing in the prediction time.

johnson442 · 2023-07-04T00:05:29Z

examples/starcoder/starcoder-mmap.cpp

+        //Shouldnt the input prompt be subracted?
+        printf("%s:  predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/(n_past - embd_inp.size()));


And subtracted embd_inp here.

ggerganov · 2023-07-04T18:03:03Z

Cool!

Long term the goal is to move support for all models into llama.cpp where we already have the mmap and GPU machinery. And therefore, try to keep the ggml examples simple and minimalistic.

But in any case this is useful - I'll think about if we want to merge it.

johnson442 · 2023-07-05T01:29:28Z

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

llystar · 2023-07-05T01:55:58Z

Awesome!

Is there a discussion somewhere about what shape adding new models to llama.cpp is going to take?

I thought about making this PR against that repo but wasn't sure where to even start, model_name.cpp in the root directory with adaptations to examples/main.cpp? A new directory in examples/ with its own main.cpp? Or just model specific functions in llama.cpp?

When I run the sh

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the GPU UTL seems not change(keep 0%) during inference.

is this correct?

johnson442 · 2023-07-05T02:16:21Z

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

llystar · 2023-07-05T03:31:29Z

./bin/starcoder -ngl 20 -t 24 -b 64 -m /data/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin

Run ./starcoder-mmap if you have built this branch.

thank you for the reply.

after making

cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc .. && make -j4 starcoder starcoder-quantize starcoder-mmap

CUDA Driver Version is 12.1
GPU is Tesla T4

run

./bin/starcoder-mmap -ngl 20 -t 24 -b 64 -m /model/WizardCoder-15B-1.0-GGML/WizardCoder-15B-1.0.ggmlv3.q4_0.bin -p "You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in \"\`\`\`\". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n\n\nYour answer:\n\`\`\`" --top_k 0 --top_p 0.95 --temp 0

the inference didn't work correct:

Calling starcoder_eval
You are a Python development engineer. Please complete the corresponding function according to the function comment below. And add your answer written in "```". \n\nfrom typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    """ Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    """\n\n\nYour answer:\n```<|endoftext|>

main: mem per token =   462284 bytes
main:     load time =  5219.62 ms
main:   sample time =     0.36 ms
main:  predict time =     0.00 ms / -nan ms per token
main:    total time = 18967.41 ms

johnson442 · 2023-07-05T04:06:37Z

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

llystar · 2023-07-05T05:24:56Z

When I try your prompt and parameters using either starcoder or starcoder-mmap the output is <|endoftext|>, so appears unrelated to these changes.

You can try a positive temperature if you would like more output from that particular prompt.

It works well when temperature is 0.2.

Thank you for the awesome job. The acceleration effect is good (ngl 40: 300 ms/token -> 180 ms/token )

Are there any plans to support multiple GPU in the future

ggerganov · 2023-07-14T10:26:06Z

@JohannesGaessler

This branch demonstrates sample GPU inference of Starcoder. I just synced it with the latest CUDA code from master and the inference breaks. I tried to trace it, and found that if I disable the mul_mat_vec_q kernels it works:

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index dc4b773..a0b4988 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -2459,7 +2459,7 @@ inline void ggml_cuda_op_mul_mat_vec(
         src0->type == GGML_TYPE_Q5_1 ||
         src0->type == GGML_TYPE_Q8_0;
 
-    const bool use_mul_mat_vec_q = g_compute_capabilities[id] >= 610 && mul_mat_vec_q_implemented;
+    const bool use_mul_mat_vec_q = false;
 #endif
 
     if (use_mul_mat_vec_q) {

So it might indicate some issue in those kernels and it's probably worth looking into it.

Easiest steps to repro:

Get Q4_0 model from here: https://huggingface.co/TheBloke/Starcoderplus-Guanaco-GPT4-15B-V1.0-GGML/tree/main
Create a sample prompt file p-prompt.txt that contains the following:

### Human: Write a function to check a C string for valid UTF-8 encoding without using external libs in C++.
### Assistant: Sure, here's the function:
```cpp

Build with cmake -DGGML_CUBLAS=ON
Run:

./bin/starcoder-mmap -t 8 -m models/starcoder/starcoderplus-guanaco-gpt4.ggmlv1.q4_0.bin -n 4096 --top_p 0.3 --temp 1 --top_k 9999 -f p-prompt.txt -s 123 -ngl 1

You need to offload just 1 layer to trigger the issue, but you can also offload more.
The above steps currently generate gibberish and applying the single-line patch from above fixes it.

johnson442 commented Jul 4, 2023

View reviewed changes

examples/starcoder/starcoder-mmap.cpp Show resolved Hide resolved

johnson442 commented Jul 4, 2023

View reviewed changes

johnson442 force-pushed the starcoder-mmap branch from 204478f to 5e4ca81 Compare July 5, 2023 03:03

Add basic mmap & GPU offload starcoder example

17b4ad7

johnson442 force-pushed the starcoder-mmap branch from 5e4ca81 to 17b4ad7 Compare July 5, 2023 03:05

ggerganov added 2 commits July 14, 2023 12:16

Merge branch 'master' into HEAD

7373780

starcode-mmap : adapt to new ggml API

af5a6f1

JohannesGaessler mentioned this pull request Jul 14, 2023

Fixed CUDA arch: OFF -> 52;61 #389

Merged

Merge remote-tracking branch 'origin/master' into HEAD

fba36cf

ggerganov merged commit d2b178e into ggerganov:master Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starcoder mmap (and gpu) example #338

Starcoder mmap (and gpu) example #338

johnson442 commented Jul 3, 2023 •

edited

Loading

johnson442 Jul 4, 2023

johnson442 Jul 4, 2023

ggerganov commented Jul 4, 2023

johnson442 commented Jul 5, 2023 •

edited

Loading

llystar commented Jul 5, 2023

johnson442 commented Jul 5, 2023

llystar commented Jul 5, 2023 •

edited

Loading

johnson442 commented Jul 5, 2023

llystar commented Jul 5, 2023

ggerganov commented Jul 14, 2023 •

edited

Loading

		//Shouldnt the input prompt be subracted?
		printf("%s: predict time = %8.2f ms / %.2f ms per token\n", __func__, t_predict_us/1000.0f, t_predict_us/1000.0f/(n_past - embd_inp.size()));

Starcoder mmap (and gpu) example #338

Starcoder mmap (and gpu) example #338

Conversation

johnson442 commented Jul 3, 2023 • edited Loading

johnson442 Jul 4, 2023

Choose a reason for hiding this comment

johnson442 Jul 4, 2023

Choose a reason for hiding this comment

ggerganov commented Jul 4, 2023

johnson442 commented Jul 5, 2023 • edited Loading

llystar commented Jul 5, 2023

johnson442 commented Jul 5, 2023

llystar commented Jul 5, 2023 • edited Loading

johnson442 commented Jul 5, 2023

llystar commented Jul 5, 2023

ggerganov commented Jul 14, 2023 • edited Loading

johnson442 commented Jul 3, 2023 •

edited

Loading

johnson442 commented Jul 5, 2023 •

edited

Loading

llystar commented Jul 5, 2023 •

edited

Loading

ggerganov commented Jul 14, 2023 •

edited

Loading