FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format #2961
Replies: 7 comments 5 replies
-
Thank you for the work, I tried the instruct Q8 version for the summarization task, but I didn't get good responses. I have tried enclosing the prompt with the template provided in the original hugging face model card, but the model was just returning the original text without a summary. Have you tried them for summarization? The model is supposedly fine-tuned for summarization. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, I have to confirm that behaviour: according to the logs written by the current revision of llama.cpp (I just synced with the latest commit), the prompt is just tokenized but no inference is run. I'll still have to investigate why... |
Beta Was this translation helpful? Give feedback.
-
Ok, here are my findings:
with
Don't ask me why, but with all these LLaMA variants and derivatives I often (always?) get better responses when using the "### Instruction: ... ### Response:" prompt pattern - perhaps, this is a result of earlier trainings or older training data sets? The
stops the model from spitting out endless (almost) empty lines. Important: the "empty" lines contain in fact two blanks each - as this is what the model actually outputs. Summarization works now, but you will have to experiment with the size of the text to summarize and your context length: long contexts needs huge amounts of RAM and also take their time to process the prompt. What I haven't checked so far is whether the Together Computer models perform better (with respect to summarization) than the original LLaMA-2 models or not: without enough RAM, you may not benefit from the 32k context size of the fine-tuned models |
Beta Was this translation helpful? Give feedback.
-
Really thanks for your reply. But, when the input is larger (about 7000 tokens in my case), the model just completes the text without a summary. Note that I couldn't enable 'tfs' because there is no option for that in the python version. I used a context size of 8192. |
Beta Was this translation helpful? Give feedback.
-
Yea, it is a 128GB ram machine
…________________________________
From: Andreas Rozek ***@***.***>
Sent: Monday, September 4, 2023 4:36:48 PM
To: ggerganov/llama.cpp ***@***.***>
Cc: Yaser ***@***.***>; Comment ***@***.***>
Subject: Re: [ggerganov/llama.cpp] FYI: Quantizations of LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct (both trained w/ context lengths of 32K!) available in GGUF format (Discussion #2961)
are you sure that you have enough memory for large context sizes and long texts?
—
Reply to this email directly, view it on GitHub<#2961 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGCBNA4ID3TRME76LAVWNV3XYXDOBANCNFSM6AAAAAA4H2AX44>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Did a small (but multiple hours long running) benchmark (ignore the funky model name, its converted from the ggml from TheBloke, yes i patched the rope_scale)
As expected, the performance degrades a lot, the larger the KV cache becomes. |
Beta Was this translation helpful? Give feedback.
-
Cool - thanks for your effort (and for sharing its results)! |
Beta Was this translation helpful? Give feedback.
-
Just to let you know:
I've quantized Together Computer, Inc.'s LLaMA-2-7B-32K and Llama-2-7B-32K-Instruct models and uploaded them in GGUF format - ready to be used with llama.cpp
Both have been trained with a context length of 32K - and, provided that you have enough RAM, you can benefit from such large contexts right away!
You will find the quantizations for LLaMA-2-7B-32K_GGUF and LLaMA-2-7B-32K-Instruct_GGUF on Hugging Face.
In my personal opinion, these models give much better responses than the new CodeLLaMA models (which rather disappointed me)
Enjoy!
Beta Was this translation helpful? Give feedback.
All reactions