Anyone keeping tabs on Vicuna, a new LLaMA-based model? #643

jessejohnson · 2023-03-31T01:36:20Z

jessejohnson
Mar 31, 2023

Link to blog post, demo and GH: https://vicuna.lmsys.org/, https://chat.lmsys.org/, https://github.com/lm-sys/FastChat

This looks like the most capable LLaMA right now. They're yet to release the weights :)

BetaDoggo · 2023-03-31T01:43:53Z

BetaDoggo
Mar 31, 2023

Dep is pissed that they stole his name.
https://huggingface.co/chavinlo/vicuna

2 replies

GeorgeUCB Mar 31, 2023

It was bound to happen at some point.

edmundronald Apr 4, 2023

So what's the news on this? Are the quantified weights available?

jessejohnson · 2023-04-04T07:29:55Z

jessejohnson
Apr 4, 2023
Author

There are ggml weights on 🤗uploaded just yesterday. Haven’t had the chance to try them yet

On Tue, 4 Apr 2023 at 02:23, edmundronald ***@***.***> wrote: So what's the news on this? Are the quantified weights available? — Reply to this email directly, view it on GitHub <#643 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJS2AIFDGFXB5YM42NW65LW7OAZHANCNFSM6AAAAAAWOAZEQQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Regards, Jesse Jojo Johnson http://www.jessejojojohnson.com/

0 replies

jxy · 2023-04-04T15:26:36Z

jxy
Apr 4, 2023

Tried the one on huggingface hub, eachadea/ggml-vicuna-13b-4bit.

./main -m ./models/vicuna_13B/ggml-vicuna-13b-4bit.bin -t 4 -c 2048 -n 2048 --color -i --reverse-prompt '### Human:' -p '### Human: What is the relation between llama and vicuna?
### Assistant:'
main: seed = 1680621644
llama_model_load: loading model from './models/vicuna_13B/ggml-vicuna-13b-4bit.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.84 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from './models/vicuna_13B/ggml-vicuna-13b-4bit.bin'
llama_model_load: model size =  7759.40 MB / num tensors = 363
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Human:'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 2048, n_batch = 8, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 ### Human: What is the relation between llama and vicuna?
### Assistant: Llama and vicuña are two different species of animals that are closely related to each other. Both are members of the camelid family, which includes camels, llamas, alpacas, and vicuñas. However, the main difference between them is their size and physical characteristics.
Llama: The llama is a larger animal compared to the vicuña. It typically stands around 5 feet tall (1.5 meters) and weighs around 400-600 pounds (180-270 kilograms). They have a stocky build, with a thick fleece that can vary in color from black to white, depending on the breed. Llamas are known for their hardiness, intelligence, and ability to carry heavy loads over long distances.
Vicuña: The vicuña is a smaller animal than the llama. It typically stands around 3 feet tall (0.9 meters) and weighs around 150-200 pounds (68-91 kilograms). They have a slender build, with a fine, delicate fleece that is prized for its softness and warmth. Vicuñas are known for their extraordinary ability to produce fiber that is stronger than steel, yet incredibly lightweight and soft.
In terms of behavior, both animals are social and require the company of other members of their species. Llamas and vicuñas have a similar diet, which mainly consists of grasses, but they may also eat leaves and bark from trees and shrubs. They both have a long history of being domesticated for use as pack animals and for producing wool or fiber.
In summary, the llama and vicuña are related species within the camelid family, with the main difference being their size and physical characteristics. Llamas are larger and more hardy than vicuñas, while vicuñas produce finer and softer fiber that is highly prized for its properties.
### Human:

12 replies

edmundronald Apr 5, 2023

Disregard if dumb: It looks like there's a specific docker script called entrypoint.sh which is buggy here.
I'm not getting this error, but am compiling and running yesterday's llama.cpp from the repo direcly on my Macbook

jmtatsch Apr 6, 2023

Hmm, I compiled from source on my M1 Macbook and it runs fine with

./main -m ./models/ggml-vicuna-13b-4bit.bin -n 256 --repeat_penalty 1.1 --color -i -r "### Human:" -f prompts/chat-with-vicuna.txt

and the same weights. So there seems to be some issue with the docker or the x64 platform.

jmtatsch Apr 6, 2023

Docker issue is mentioned here #537

thomasgatt Apr 17, 2023

@jmtatsch how's the performance on your Macbook? I'm running ggml-vicuna-13b-1.1-q4_0.bin on my M1 Pro 16gb ram and the performance is really slow. On my Windows Intel i7 8th Gen it's running like x5 faster...

seunosewa Apr 27, 2023

Try to match the number of performance cores on your mac

dbddv01 · 2023-04-05T07:03:52Z

dbddv01
Apr 5, 2023

First tries with vicunia 13b 4bit here. A zero-shot example here below. The answer is not that bad and written in the style of its big bro. Looks quite interesting.

1 reply

underlines Apr 5, 2023

you might want to try codealpaca fine-tuned gpt4all-alpaca-oa-codealpaca-lora-7b if you specifically ask coding related questions.

JoseConseco · 2023-04-05T10:45:43Z

JoseConseco
Apr 5, 2023

Here is prompt for viscuna for llama.cpp
chat-with-vicuna.txt

I run it like so:
./main -m ./models/ggml-vicuna-13b-4bit.bin -n 256 --repeat_penalty 1.1 --color -i -r "### Human:" -f prompts/chat-with-vicuna.txt
I guess this is all that is needed for PR for Vicuna support in llama.cpp ...

Btw. any idea how do I redirect Vicuna output to some TTSpeech program ?

3 replies

x066it Apr 5, 2023

Check out whisper.cpp module named "talk-llama"

ianscrivener Jul 26, 2023

@MaratZakirov,
there are 3 versions of GGML. 4 months ago might be an old version? There is no mention on GGML version.. so I guess that is a bad sign.

Suggest you get the latest llama.cpp code and make sure to get a GGMLv3 model

TheBloke does quality models - https://huggingface.co/TheBloke/vicuna-7B-v1.3-GGML/tree/main

MaratZakirov Jul 26, 2023

I guess ggml version might be the issue. Thank you for link will try it

UPDATE: ggmlv3 was the issue thx again.

underlines · 2023-04-05T15:16:09Z

underlines
Apr 5, 2023

Yes, I keep tab on all llama descendants, under "models":

https://github.com/underlines/awesome-marketing-datascience/blob/master/awesome-ai.md

2 replies

dogjamboree Apr 5, 2023

FYI, It's missing the new Koala model from the same people who brought us Vicuna. Would love to see a version that could run under llama.cpp so I could check it out on my measly m2 pro processor...
https://bair.berkeley.edu/blog/2023/04/03/koala/

underlines Apr 10, 2023

FYI, It's missing the new Koala model from the same people who brought us Vicuna. Would love to see a version that could run under llama.cpp so I could check it out on my measly m2 pro processor... https://bair.berkeley.edu/blog/2023/04/03/koala/

added the koala models and fine tuned versions and some third party models that are not based on llama.

rg089 · 2023-04-06T08:57:18Z

rg089
Apr 6, 2023

Sorry if this is obvious, but is there a way currently to run the quantized Vicuna model in Python interactively on CPU (any bindings)? Or a stable way to call the executable from Python interactively?

5 replies

ghost Apr 6, 2023

llamacpp?

Edinburgh020 Apr 12, 2023

@rg089
download llama.cpp and run with https://huggingface.co/chharlesonfire/ggml-vicuna-7b-4bit .

./main -m ./ggml-vicuna-7b-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

hope this can help you.

MaratZakirov Jul 26, 2023

llama.cpp: loading model from ../Vikuna/ggml-vicuna-7b-q4_0.bin
error loading model: unexpectedly reached end of file
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../Vikuna/ggml-vicuna-7b-q4_0.bin'

dbmikus Aug 13, 2023

I'm getting the same error. @MaratZakirov were you able to figure out why it was failing to load the model? I downloaded it from https://huggingface.co/chharlesonfire/ggml-vicuna-7b-4bit/tree/main.

ianscrivener Aug 13, 2023

This problem was diagnosed to an OLD GGML version (https://huggingface.co/chharlesonfire/ggml-vicuna-7b-4bit/tree/main.)

GGML v1 & v2 will not work with the latest code

Please download a current, GGMLv3, from [The Bloke](https://huggingface.co/TheBloke)

Edinburgh020 · 2023-04-12T08:16:14Z

Edinburgh020
Apr 12, 2023

i found this vicuna-7b-4bit in hf. https://huggingface.co/chharlesonfire/ggml-vicuna-7b-4bit

and with chat-with-vicuna.txt (provided by @JoseConseco , from lm-sys/FastChat#364)

and here is my result:

1 reply

Edinburgh020 Apr 12, 2023

And compare to chat-with-bob, we can find some difference.

chat-with-bob.txt

oderwat · 2023-04-14T13:01:27Z

oderwat
Apr 14, 2023

I just use an additional -r "##" with Vicuna (vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g) and so far, I am happy with the results.

0 replies

thomasgatt · 2023-04-17T10:27:54Z

thomasgatt
Apr 17, 2023

Anyone getting very slow performance on llama.cpp on M1 Pro 16gb RAM? Running ggml-vicuna-13b-1.1-q4_0.bin from https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main

1 reply

oderwat Apr 17, 2023

Anyone getting very slow performance on llama.cpp on M1 Pro 16gb RAM? Running ggml-vicuna-13b-1.1-q4_0.bin from https://huggingface.co/eachadea/ggml-vicuna-13b-1.1/tree/main

You should watch your swap activity. If it is swapping, it will be very slow. There may be just not enough free RAM left on your system for a 13b model.

kharvd · 2023-04-29T01:23:59Z

kharvd
Apr 29, 2023

Seems to work with StableVicuna as well https://stability.ai/blog/stablevicuna-open-source-rlhf-chatbot

Here are the steps if someone finds it useful:

git clone https://huggingface.co/CarperAI/stable-vicuna-13b-delta
cd stable-vicuna-13b-delta
pip install torch tqdm transformers sentencepiece
python3 apply_delta.py --base-model-path <path-to-llama-weights-in-transformers-format> --target-model-path stable-vicuna-13b --delta-path .
cd llama.cpp
python convert-pth-to-ggml.py ./models/stable-vicuna-13b 1
./quantize ./models/stable-vicuna-13b/ggml-model-f16.bin ./my-models/stable-vicuna-13b/ggml-model-q4_0.bin q4_0

To convert LLaMA weights to the transformers format, I used this guide

git clone git@github.com:huggingface/transformers.git
cd transformers
pip install accelerate protobuf==3.20 sentencepiece tokenizers
python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir <path_to_llama_weights> --model_size 13B --output_dir <path-to-llama-weights-in-transformers-format>

(this requires ~30 GB of RAM)

2 replies

KevinColemanInc May 7, 2023

If anyone else gets this: "main: failed to quantize model from './models/stable-vicuna-13b/ggml-model-f16.bin'"

You may need to rebuild your quantizer. In llama-cpp, run $ make

warrenmwang May 20, 2023

Following your instructions there, I'm seeing ~30gb ram use just to load the 13B base model for me then I run OOM (I have 40 gb ram) when it starts loading the delta, anyone know a reason why or if im missing a step?

rlancemartin · 2023-05-05T04:44:24Z

rlancemartin
May 5, 2023

How are folks running these models w/ reasonable latency? I've tested ggml-vicuna-7b-q4_0.bin and ggml-vicuna-13b-1.1-q4_0.bin on 16 GB RAM M1 Macbook Pro. Both are quite slow (as noted above for the 13b model). The above note suggests ~30GB RAM required for the 13b model. Obviously, the ability to run any of these models at all on a Macbook is very impressive, so I'm not really complaining, but am curious in practice what environment folks use for routine inference.

10 replies

perak May 6, 2023

Also, did you compiled it just by typing make or you are using some IDE which possibly compiled it with debug symbols? I got big performance difference between release and debug builds.

You can check directly your binary with dsymutil -s ./main | grep N_OSO which should give empty response on release build (and non-empty response on debug build).

dogjamboree May 6, 2023

I'm not sure but I can tell you from personal experience in looking at timing (and listening to the fan!) that specifying the # of PERFORMANCE CORES vs total cores is the best strategy. So specifying 10 cores (total number of cores on the base m2 Pro ) vs 6 cores (number of performance cores) is significantly slower.

If I use --t 8, the fan the fan barely blows but with --t 6, I can barely hear myself talking :)

perak May 6, 2023

@dogjamboree oh yes, I forgot to mention that: I got 10 cores on M1 Pro: 2 efficiency + 8 performance, that's why I am using -t 8. So @rlancemartin you should set number of threads to your number of performance cores for optimal results.

rlancemartin May 7, 2023

You can check directly your binary with dsymutil -s ./main | grep N_OSO which should give empty response on release build (and non-empty response on debug build).

Thanks!

Confirmed I am using the release build (empty response w/ dsymutil -s ./main | grep N_OSO ).

I got 10 cores on M1 Pro: 2 efficiency + 8 performance, that's why I am using -t 8. So @rlancemartin you should set number of threads to your number of performance cores for optimal results.

I'm also on a M1 Pro w/ 16 GB RAM (6.5 GB used as baseline) and eight performance cores.

Running this across values of t and timing the query What is the largest city in Germany? ...

./main -m ./models/vicuna_7B/ggml-vicuna-7b-q4_0.bin -n 256 --repeat_penalty 1.0 --mlock --color -i -t xxx -r "User:" -f prompts/chat-with-bob.txt

... -t 6 looks best:
-t 10: > 30 sec (killed process)
-t 8: 18.7 sec
-t 6: 8.2 sec
-t 4: 9.7 sec
-t 2: 18.9 sec

edwios May 8, 2023

Using ggml-vicuna-13b-q4_0.bin and t=6 (slightly better with t=8 on my machine [M1 Max 64GB RAM]) gave:

llama_print_timings:        load time =  5214.70 ms
llama_print_timings:      sample time =     0.32 ms /    13 runs   (    0.02 ms per run)
llama_print_timings: prompt eval time =  5526.28 ms /   108 tokens (   51.17 ms per token)
llama_print_timings:        eval time =   988.39 ms /    12 runs   (   82.37 ms per run)

(Edited: 7b->13b)

dogjamboree · 2023-05-06T20:06:29Z

dogjamboree
May 6, 2023

Slightly off topic, but on the subject of mlock, is there any credence to the idea that pushing the limits of your ram and resorting to swap space on a modern SSD could burn it out relatively quickly is true? I was considering getting an external thunderbolt SSD just for swap because with 32gb ram these 30b parameter models really do seem to swap a lot, even though I have mlock on and they say they fit within my available ram (I guess the swapping is happening in other apps such as chrome, if I don't kill it).

0 replies

aidaho · 2023-05-07T03:58:56Z

aidaho
May 7, 2023

Slightly off topic, but on the subject of mlock, is there any credence to the idea that pushing the limits of your ram and resorting to swap space on a modern SSD could burn it out relatively quickly is true? I was considering getting an external thunderbolt SSD just for swap because with 32gb ram these 30b parameter models really do seem to swap a lot, even though I have mlock on and they say they fit within my available ram (I guess the swapping is happening in other apps such as chrome, if I don't kill it).

This isn't true. However, this is not a solution either. Swapping even on fast modern NVME will increase token generation time by orders of magnitude. BTW you should be able to fully fit 30b model in 32gb ram after some swapping out of the unnecessary programs. Regards, Serhii.

0 replies

FNsi · 2023-05-07T10:37:27Z

FNsi
May 7, 2023

If merge ton of loras, will the 13b beat the 65b version like llama beat gpt3?

2 replies

KevinColemanInc May 7, 2023

There isn't an inherit way to measure the performance of llama vs gpt3. Currently, most comparisons involve having the models take a test (like the bar exam or a medical exam) and compare the results.

I'd imagine you could take a single lora tuned to your test and it would beat gpt3. Maybe many loras can solve many tests (but there is some risk they could interfere with each other0.

FNsi May 9, 2023

There isn't an inherit way to measure the performance of llama vs gpt3. Currently, most comparisons involve having the models take a test (like the bar exam or a medical exam) and compare the results.

I'd imagine you could take a single lora tuned to your test and it would beat gpt3. Maybe many loras can solve many tests (but there is some risk they could interfere with each other0.

And with cheapest price😁

rlancemartin · 2023-05-08T03:44:56Z

rlancemartin
May 8, 2023

I've been using Vicuna for Question-Answering.

I'm using the py-bindings (llama-cpp-python) and LangChain.

My prompt template is:

template = """Use the following pieces of context to answer the question at the end. Use three sentences maximum. 
{context}
Question: {question}
Answer: Think step by step """

I'm initializing the model:

llm = LlamaCpp(
                model_path="/<path>/vicuna_13B/ggml-vicuna-13b-4bit.bin",
                callback_manager=callback_manager,
                verbose=True,
                n_threads=6,
                n_ctx=2048,
                use_mlock=True)

I notice it will answer questions in a rhetorical style with ### Human: and ### Assistant: in the answer:

### Human: xxx? 
### Assistant: xxx

Have others seen this and / or should I be using an alternative prompt?

1 reply

KevinColemanInc May 8, 2023

I noticed the same thing with my own prompt. I suspect it’s due to the fine tuning of the training data.

I don’t think they publicly released their exact dataset otherwise I’d spend the $1000 to modify it to how I want it.

FNsi · 2023-05-09T04:22:02Z

FNsi
May 9, 2023

Just notice,

wizard-vicuna

0 replies

edmundronald · 2023-07-26T11:37:59Z

edmundronald
Jul 26, 2023

It is going to be necessary to provide an automatic version conversion script. People who quantise a model aren't going to redo them and users don't have the ability to figure out compatibility issues

…

On Wed, Jul 26, 2023 at 9:58 AM MaratZakirov ***@***.***> wrote: I guess ggml version might be the issue. Thank you for link will try it — Reply to this email directly, view it on GitHub <#643 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWGTOK5WYKHIJOH5ATCCDLXSDEY5ANCNFSM6AAAAAAWOAZEQQ> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

0 replies

edmundronald · 2023-07-26T11:38:34Z

edmundronald
Jul 26, 2023

The software should output a diagnostic explaining that it is dealing with a prior version. On Wed, Jul 26, 2023 at 1:37 PM edmund ronald ***@***.***> wrote:

…

It is going to be necessary to provide an automatic version conversion script. People who quantise a model aren't going to redo them and users don't have the ability to figure out compatibility issues On Wed, Jul 26, 2023 at 9:58 AM MaratZakirov ***@***.***> wrote: > I guess ggml version might be the issue. Thank you for link will try it > > — > Reply to this email directly, view it on GitHub > <#643 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABWGTOK5WYKHIJOH5ATCCDLXSDEY5ANCNFSM6AAAAAAWOAZEQQ> > . > You are receiving this because you commented.Message ID: <ggerganov/llama > .***@***.***> >

0 replies

This comment was marked as off-topic.

Sign in to view

Anyone keeping tabs on Vicuna, a new LLaMA-based model? #643

Replies: 21 comments · 42 replies

jessejohnson Apr 4, 2023 Author

This comment was marked as off-topic.

This comment was marked as off-topic.

Replies: 21 comments 42 replies

jessejohnson
Apr 4, 2023
Author