FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format #2961

rozek · 2023-09-01T16:46:36Z

rozek
Sep 1, 2023

Just to let you know:

I've quantized Together Computer, Inc.'s LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct models and uploaded them in GGUF format - ready to be used with llama.cpp

Both have been trained with a context length of 32K - and, provided that you have enough RAM, you can benefit from such large contexts right away!

You will find the quantizations for LLaMA-2-7B-32K_GGUF and LLaMA-2-7B-32K-Instruct_GGUF on Hugging Face.

In my personal opinion, these models give much better responses than the new CodeLLaMA models (which rather disappointed me)

Enjoy!

YaserAlOsh · 2023-09-03T11:58:53Z

YaserAlOsh
Sep 3, 2023

Thank you for the work,

I tried the instruct Q8 version for the summarization task, but I didn't get good responses. I have tried enclosing the prompt with the template provided in the original hugging face model card, but the model was just returning the original text without a summary.
I also tried only asking for a summary, but it also returned the original text after the summary.

Have you tried them for summarization? The model is supposedly fine-tuned for summarization.
Thanks again.

0 replies

rozek · 2023-09-04T06:28:38Z

rozek
Sep 4, 2023
Author

Unfortunately, I have to confirm that behaviour: according to the logs written by the current revision of llama.cpp (I just synced with the latest commit), the prompt is just tokenized but no inference is run.

I'll still have to investigate why...

0 replies

rozek · 2023-09-04T07:21:55Z

rozek
Sep 4, 2023
Author

Ok, here are my findings:

./llama -m LLaMA-2-7B-32K-Instruct-Q8_0.gguf -t 4 -ngl 10 \
  -c 4096 -n 2048 \
  --temp 0 --top-k 1 --tfs 0.95 \
  -r "[INST]" -r "[/INST]"  -r "###" -r "
  
  
  
" \
  --file ./Alice_in_Wonderland-Prompt_I.txt

with Alice_in_Wonderland-Prompt_I.txt looking as follows:

Perform to the best of your ability
### Instruction:
Summarize the following text:
... <<<< insert some parts of "Alice in Wonderland" here, depending on your context size
### Response:

Don't ask me why, but with all these LLaMA variants and derivatives I often (always?) get better responses when using the "### Instruction: ... ### Response:" prompt pattern - perhaps, this is a result of earlier trainings or older training data sets?

The -r "###" avoids output loops, and the strange looking

-r "
  
  
  
"

stops the model from spitting out endless (almost) empty lines. Important: the "empty" lines contain in fact two blanks each - as this is what the model actually outputs.

Summarization works now, but you will have to experiment with the size of the text to summarize and your context length: long contexts needs huge amounts of RAM and also take their time to process the prompt.

What I haven't checked so far is whether the Together Computer models perform better (with respect to summarization) than the original LLaMA-2 models or not: without enough RAM, you may not benefit from the 32k context size of the fine-tuned models

0 replies

YaserAlOsh · 2023-09-04T10:41:58Z

YaserAlOsh
Sep 4, 2023

Really thanks for your reply.
I have tried prompting the model with these options (On llama-cpp-python) and I do get better results.
However, I had to change the ### Response to ### Keypoints or ### Summary to get a summary for long text (> 4000Token), because otherwise the model would just answer questions in the text.

But, when the input is larger (about 7000 tokens in my case), the model just completes the text without a summary.
I couldn't solve this problem yet.

Note that I couldn't enable 'tfs' because there is no option for that in the python version. I used a context size of 8192.

1 reply

rozek Sep 4, 2023
Author

are you sure that you have enough memory for large context sizes and long texts?

YaserAlOsh · 2023-09-04T15:23:48Z

YaserAlOsh
Sep 4, 2023

Yea, it is a 128GB ram machine

…

________________________________ From: Andreas Rozek ***@***.***> Sent: Monday, September 4, 2023 4:36:48 PM To: ggerganov/llama.cpp ***@***.***> Cc: Yaser ***@***.***>; Comment ***@***.***> Subject: Re: [ggerganov/llama.cpp] FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format (Discussion #2961) are you sure that you have enough memory for large context sizes and long texts? — Reply to this email directly, view it on GitHub<#2961 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCBNA4ID3TRME76LAVWNV3XYXDOBANCNFSM6AAAAAA4H2AX44>. You are receiving this because you commented.Message ID: ***@***.***>

2 replies

rozek Sep 4, 2023
Author

not bad - and 128GB where you run the inference? (128GB for CPU => -ngl 0, 128GB on GPU => -ngl n with n >= 1)

Sorry for the dumb question, but I'm afraid that you silently run into a memory problem - and I know from my practical experience, that memory overflows aren't always properly logged as such but may lead to any kind of failure, even locked-up computers or full system crashes

YaserAlOsh Sep 4, 2023

It is okay. In my case, I am offloading 25 layers to the GPU. The GPU was using about 11.7 GB of memory. While the remaining were in CPU RAM. My GPU has 12 GB of memory, so maybe it is actually overflowing as you suggested.

Green-Sky · 2023-09-04T16:35:22Z

Green-Sky
Sep 4, 2023
Collaborator

Did a small (but multiple hours long running) benchmark

(ignore the funky model name, its converted from the ggml from TheBloke, yes i patched the rope_scale)

$ ./llama-bench -m ~/workspace/llama.cpp/models/llama-2-7b-32k-instruct.q4_0.gguf -ngl 32 -n 128 -n 16384
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5
build: cec628e (1104)

model	size	params	backend	ngl	test	t/s
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	pp 512	409.29 ± 9.71
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 128	43.04 ± 0.18
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 256	41.73 ± 1.56
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 512	37.83 ± 1.11
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 1024	30.97 ± 0.59
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 2048	21.69 ± 0.06
llama-2-7b-32k-instruct.ggmlv3.q4_0.bin	3.56 GiB	6.74 B	CUDA	32	tg 16384	6.06 ± 0.29

As expected, the performance degrades a lot, the larger the KV cache becomes.
Especially, since it wont fit into vram (i only have 8gig).

1 reply

Green-Sky Sep 4, 2023
Collaborator

iirc, 70B/34B should be faster at the 16k token mark, because of GQA. (kv cache is also much smaller)

rozek · 2023-09-05T03:59:12Z

rozek
Sep 5, 2023
Author

Cool - thanks for your effort (and for sharing its results)!

1 reply

Darrshan-Sankar Jul 8, 2024

Could you please let me know, how to load the model in langchain's extension for Llama.cpp, where my required context length is 32k and for this what is the minimum specifications required?

I tried in colab, kaggle, and my local system with 32GB RAM and 6GB vRAM with the following resources:
#2402 (comment)
https://github.com/ggerganov/llama.cpp/tree/master/examples/main#extended-context-size

but still face troubles with empty/blabbering outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format #2961

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format #2961

rozek Sep 1, 2023

Replies: 7 comments · 5 replies

YaserAlOsh Sep 3, 2023

rozek Sep 4, 2023 Author

rozek Sep 4, 2023 Author

YaserAlOsh Sep 4, 2023

rozek Sep 4, 2023 Author

YaserAlOsh Sep 4, 2023

rozek Sep 4, 2023 Author

YaserAlOsh Sep 4, 2023

Green-Sky Sep 4, 2023 Collaborator

Green-Sky Sep 4, 2023 Collaborator

rozek Sep 5, 2023 Author

Darrshan-Sankar Jul 8, 2024

rozek
Sep 1, 2023

Replies: 7 comments 5 replies

YaserAlOsh
Sep 3, 2023

rozek
Sep 4, 2023
Author

rozek
Sep 4, 2023
Author

YaserAlOsh
Sep 4, 2023

rozek Sep 4, 2023
Author

YaserAlOsh
Sep 4, 2023

rozek Sep 4, 2023
Author

Green-Sky
Sep 4, 2023
Collaborator

Green-Sky Sep 4, 2023
Collaborator

rozek
Sep 5, 2023
Author