spike: small model presets experiments #976

jalling97 · 2024-09-03T20:38:24Z

Model presets

LeapfrogAI currently has two primary models that are used on the backend, but more should be added/tested. By implementing certain small models and evaluating their efficacy from a human perspective, we can make better decisions as to what models to use and evaluate against.

Goal

To determine a list of models and model configs that work well in LFAI from a human-in-the-loop perspective (no automated evals).

Methodology

Determine a short list (up to 5) models to test in LFAI (from HuggingFace is the simplest way to do this)
Run LFAI in a local dev context, replacing the model backend with one of these choices
Change the config parameters to gauge performance
Record config

Limitations

Model licenses are very permissive (e.g MIT, Apache-2.0)
Must be compatible with vllm or llama-cpp-python (the two frameworks currently supported by LFAI)
Limited vRAM requirements (12-16Gb including model weights and context)

Delivery

A list of models, different config options, and a respective set of subjective scores gauging their performance within LFAI.
A report of the methodology used to evaluate these models (so the experiment can be replicated)
Potentially a repository that contains code used to run these evaluations (and what was evaluated on)

jxtngx · 2024-09-06T15:56:57Z

I can take this.

Here's an example from AWS on collecting HIL feedback for LM evals:

https://github.com/aws-samples/human-in-the-loop-llm-eval-blog

Are the questions in that example appropriate for the intent of this experiment?

jxtngx · 2024-09-06T16:23:14Z

Here's a list of additional models for consideration.

The above models would be used as is – without any quantization; unless it is preferred that I also quantize the models and provide feedback on 8-bit and 4-bit versions.

I can also provide feedback on the models listed in the docs.

jxtngx · 2024-09-06T16:39:37Z

I realize that not all of the models may be supported by llama-cpp-python or vLLM, and that I may need to add support for a custom model; especially for CPU only on macOS.

Here are two GGUF conversion examples for reference:

jxtngx · 2024-09-06T16:42:36Z

As for the models that have proprietary licenses – if anything, it will be beneficial to run these subjective evals and provide feedback for the community.

From a user perspective – I'd probably opt for these models on my own, foregoing the default selections – though I understand the need for the LFAI team to ship with a default model that is very permissive in terms of licensing.

jalling97 · 2024-09-06T20:22:22Z

For reference I've updated the issue description a little bit to help clarify a few things.

For the AWS HIL example, that framework looks like it makes sense for what we're asking for, so if you would like to use it as a basis, go for it!

I added these to the description, but we have a few limitations I didn't outline originally:

Model licenses are very permissive (e.g MIT, Apache-2.0)
Models must be compatible with vllm or llama-cpp-python (the two frameworks currently supported by LFAI)
Limited vRAM requirements (12-16Gb including model weights and context)

If you're on MacOS, we won't ask you to work outside of the deployment context available to you, so anything that can be run on llama-cpp-python is great. For simplicity, let's stick to model licenses that are as permissive as Apache-2.0 or greater.

The vRAM requirements are ideally less than 12Gb, but anything that would fit under 16Gb is worth checking for our purposes (i.e single-GPU laptop deployment scenarios). That likely means quantizations, so if you can find quantizations for models you want to test, great! We're also open to managing our own quantized models, so feel free to experiment with your own quantizations if you want to, but it's certainly not required.

It would be greatly helpful to compare any models you test with the current defaults from the docs (as you already listed). That would act as a fantastic point of comparison.

jxtngx · 2024-09-06T22:40:00Z

Important

RWKV is a recurrent model architecture (paper)

RWKV is a different architecture than transformers based models. The model arch is not available in llama-cpp, but can be made available for use with llama-cpp-python by using the gguf-my-repo HF Space.

With regard to the HF Space, a user must understand the quantization key provided below:

https://huggingface.co/docs/hub/en/gguf#quantization-types.

Note

quantized, llama-cpp-python compatible models will be made available in this HF collection.

jxtngx · 2024-09-23T13:11:58Z

I've successfully installed and deployed LFAI on my personal machine and can proceed by interacting with (1) the UI and/or (2) the API.

Using the UI may be more efficient to collect (fewer) results faster – given I won't have to learn the API to write any code. Additionally, using the UI will allow me to provide UX feedback on both the UI and deploying LFAI on macOS.
Using the API will be more effective to collect more results into a tabular dataset that can be shared with several people who can then provide subjective feedback. Given the objective is to select a model, and not provide UX feedback on the UI, it is probably best to learn to use the API so that I can provide the output to the team (for subjective feedback). Additionally, doing so would allow me to pivot to providing feedback from a devx perspective.

Which would you prefer for me to do @jalling97?

jxtngx · 2024-09-23T18:27:34Z

I intend to collect results for the following models in the first iteration of this experiment:

Model Family	Model Size	Quantization	HF Link	License
Llama 3.1	8B	8 bit	jxtngx/Meta-Llama-3.1-8B-Instruct-Q8_0-GGUF	Llama 3.1 Community
Llama 3.1	8B	8 bit	jxtngx/Meta-Llama-3.1-8B-Q8_0-GGUF	Llama 3.1 Community
Llama 3.1	8B	8 bit	jxtngx/Hermes-3-Llama-3.1-8B-Q8_0-GGUF	Llama 3.1 Community
Mistral	7B	8 bit	jxtngx/Hermes-2-Pro-Mistral-7B-Q8_0-GGUF	Apache 2.0
Phi 3 Mini	3.8B	8 bit	jxtngx/Phi-3-mini-128k-instruct-Q8_0-GGUF	MIT

jalling97 · 2024-09-23T18:37:35Z

@jxtngx let's go with Option 2. There's lots of value in working with the API directly as, like you mention, it'll allow you to iterate faster. The LeapfrogAI team would also greatly benefit from your feedback using the API.

As for the models, those look like great choices! I'm curious to see the impacts of instruct vs. base vs. hermes 3. I would focus more on the 4-bit quantizations, as that tends to be a slightly better fit for single-GPU laptop deployment scenarios.

jxtngx · 2024-09-23T19:02:16Z

sounds good.

I'll create the 4 bit versions then share a new table with links.

jxtngx · 2024-09-23T19:55:56Z

4 bit models:

Model Family	Model Size	Quantization	Model	License
Llama 3.1	8B	4 bit	jxtngx/Meta-Llama-3.1-8B-Q4_0-GGUF	Llama 3.1 Community
Llama 3.1	8B	4 bit	jxtngx/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF	Llama 3.1 Community
Llama 3.1	8B	4 bit	jxtngx/Hermes-3-Llama-3.1-8B-Q4_0-GGUF	Llama 3.1 Community
Mistral	7B	4 bit	jxtngx/Hermes-2-Pro-Mistral-7B-Q4_0-GGUF	Apache 2.0
Phi 3 Mini	3.8B	4 bit	jxtngx/Phi-3-mini-128k-instruct-Q4_0-GGUF	MIT

jxtngx · 2024-09-25T12:10:24Z

please note that, for each of the 5 models, there are a few flavors of 4 bit quantized versions in the HF collection. Those flavors being:

Q4_0
Q4_K_M
Q4_K_S

the tl;dr on the types is that K type quants are favored over the 0 type quants – as the later is considered a legacy quantization method.

please see below for more on quantization types found in GGUF:

jxtngx · 2024-09-25T17:26:22Z

Meta released Llama 3.2 on 25 Sep '24; and the new family of models includes 1B and 3B versions which ought to be evaluated against the Phi 3 mini and small versions.

release notes

https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices

1B variants

base: https://huggingface.co/meta-llama/Llama-3.2-1B
instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

3B variants

base: https://huggingface.co/meta-llama/Llama-3.2-3B
instruct: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

4bit quantized models

1B Instruct: jxtngx/Llama-3.2-1B-Instruct-Q4_K_M-GGUF
3B Instruct: jxtngx/Llama-3.2-3B-Instruct-Q4_K_M-GGUF

cc @jalling97

jalling97 · 2024-09-25T19:38:36Z

@jxtngx good callout!

Including these new models in the comparison would be great. To prevent from over-exploring in too many directions, feel free to take LLama3.1 8b base out of the comparison list, as the instruct finetune is usually what we'd lean towards anyways.

jxtngx · 2024-09-30T12:33:37Z

just wondering – how were the current default models selected?

jalling97 · 2024-09-30T20:34:25Z

The current default models were selected in the Fall of 2023 based on finding a balance between model performance and GPU requirements. The defaults needed to be small enough to run on GPU-enabled edge deployments while maximizing performance on the standard evaluations at the time.

cc @justinthelaw @gphorvath if either of you want to add more context

jalling97 added the spike label Sep 5, 2024

jalling97 assigned jxtngx Sep 6, 2024

jalling97 added this to the EVERGREEN milestone Sep 6, 2024

justinthelaw mentioned this issue Sep 16, 2024

feat(vllm)!: upgrade vllm backend and refactor deployment #854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: small model presets experiments #976

spike: small model presets experiments #976

jalling97 commented Sep 3, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024

jalling97 commented Sep 6, 2024

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 23, 2024 •

edited

Loading

jxtngx commented Sep 23, 2024

jalling97 commented Sep 23, 2024

jxtngx commented Sep 23, 2024

jxtngx commented Sep 23, 2024

jxtngx commented Sep 25, 2024 •

edited

Loading

jxtngx commented Sep 25, 2024 •

edited

Loading

jalling97 commented Sep 25, 2024

jxtngx commented Sep 30, 2024

jalling97 commented Sep 30, 2024

spike: small model presets experiments #976

spike: small model presets experiments #976

Comments

jalling97 commented Sep 3, 2024 • edited Loading

Model presets

Goal

Methodology

Limitations

Delivery

jxtngx commented Sep 6, 2024 • edited Loading

jxtngx commented Sep 6, 2024 • edited Loading

jxtngx commented Sep 6, 2024 • edited Loading

jxtngx commented Sep 6, 2024

jalling97 commented Sep 6, 2024

jxtngx commented Sep 6, 2024 • edited Loading

jxtngx commented Sep 23, 2024 • edited Loading

jxtngx commented Sep 23, 2024

jalling97 commented Sep 23, 2024

jxtngx commented Sep 23, 2024

jxtngx commented Sep 23, 2024

jxtngx commented Sep 25, 2024 • edited Loading

jxtngx commented Sep 25, 2024 • edited Loading

jalling97 commented Sep 25, 2024

jxtngx commented Sep 30, 2024

jalling97 commented Sep 30, 2024

jalling97 commented Sep 3, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 6, 2024 •

edited

Loading

jxtngx commented Sep 23, 2024 •

edited

Loading

jxtngx commented Sep 25, 2024 •

edited

Loading

jxtngx commented Sep 25, 2024 •

edited

Loading