Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Wishlist #156

Open
3 of 11 tasks
EricLBuehler opened this issue Apr 16, 2024 · 80 comments
Open
3 of 11 tasks

Model Wishlist #156

EricLBuehler opened this issue Apr 16, 2024 · 80 comments
Labels
models Additions to model or architectures

Comments

@EricLBuehler
Copy link
Owner

EricLBuehler commented Apr 16, 2024

Please let us know what model architectures you would like to be added!

Up to date todo list below. Please feel free to contribute any model, a PR without device mapping, ISQ, etc. will still be merged!

Language models

  • snowflake-arctic-instruct: Snowflake/snowflake-arctic-instruct
  • WizardLM-2: alpindale/WizardLM-2-8x22B
  • Command R: CohereForAI/c4ai-command-r-v01
  • Command R+: CohereForAI/c4ai-command-r-plus

Multimodal models

Embedding models

  • T5: google-t5/t5-base
  • nomic-text-embed: nomic-ai/nomic-embed-text-v1
@EricLBuehler EricLBuehler added the models Additions to model or architectures label Apr 16, 2024
@EricLBuehler EricLBuehler mentioned this issue Apr 16, 2024
14 tasks
@EricLBuehler EricLBuehler pinned this issue Apr 16, 2024
@NiuBlibing
Copy link

qwen1.5-72B-Chat

@NiuBlibing
Copy link

llama3

@EricLBuehler
Copy link
Owner Author

@NiuBlibing, we have llama3 support ready: the README has a few examples. I will add Qwen support shortly.

@EricLBuehler
Copy link
Owner Author

@NiuBlibing, I just added Qwen2 support. Quantized Qwen2 support will be added in the next few days.

@cargecla1
Copy link

@francis2tm
Copy link

Hello!
Any plans for adding multimodal (e.g. llava) and embedding models?

@EricLBuehler
Copy link
Owner Author

Can you add https://huggingface.co/Snowflake/snowflake-arctic-instruct?

@cargecla1, yes! It will be a great use case for ISQ.

@EricLBuehler
Copy link
Owner Author

Hello!
Any plans for adding multimodal (e.g. llava) and embedding models?

@francis2tm, yes. I plan on supporting Llava and embedding models this week.

@EricLBuehler
Copy link
Owner Author

@NiuBlibing, you can run Qwen now with ISQ, which will quantize it.

@kir-gadjello
Copy link

Would be nice to support at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.

@chelbos
Copy link

chelbos commented Apr 29, 2024

Would love to see some DeepSeek-VL, this model is better than Llava and spupports multiple images per prompt
https://huggingface.co/collections/deepseek-ai/deepseek-vl-65f295948133d9cf92b706d3

@chelbos
Copy link

chelbos commented Apr 29, 2024

Also, outside the LLM world, would love to see support for https://github.com/cvg/LightGlue :) but not sure if that's possible ...

@jett06
Copy link

jett06 commented Apr 29, 2024

Could you add support to for GGUF quantized Phi-3-Mini to the wishlist? Currently, this fails (built from master):

Running `./mistralrs-server gguf -m PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed -t microsoft/Phi-3-mini-128k-instruct -f /home/jett/Downloads/llms/Phi-3-mini-128k-instruct-q3_K_S.gguf`
2024-04-29T03:08:35.180939Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: false
2024-04-29T03:08:35.180975Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-04-29T03:08:35.180982Z  INFO mistralrs_server: Loading model `microsoft/Phi-3-mini-128k-instruct` on Cpu...
2024-04-29T03:08:35.180989Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-04-29T03:08:35.181017Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-04-29T03:08:35.181048Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
2024-04-29T03:08:35.181122Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-04-29T03:08:35.181133Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
Error: Unknown GGUF architecture `phi3`

@rodion-m
Copy link

It'll be great to see WizardLM-2 and suzume. And thanks for a great tool!

@W4G1
Copy link

W4G1 commented Apr 29, 2024

Command-R and Command-R+ from Cohere would be amazing 🙏

@yongkangzhao
Copy link

T5
LLAVA

@EricLBuehler
Copy link
Owner Author

@kir-gadjello

Would be nice to support at least one strong vision-language model: https://huggingface.co/openbmb/MiniCPM-V-2 https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5 with an option to compute visual frontend model on CPU. You might find it easier to ship visual transformer part via onnx.

Supporting a vision+language or multimodal model is very high priority right now.


@chelbos

Would love to see some DeepSeek-VL, this model is better than Llava and spupports multiple images per prompt
https://huggingface.co/collections/deepseek-ai/deepseek-vl-65f295948133d9cf92b706d3

I'll add this one too.

Also, outside the LLM world, would love to see support for https://github.com/cvg/LightGlue :) but not sure if that's possible ...

I will look into it!


@jett06

Could you add support to for GGUF quantized Phi-3-Mini to the wishlist?

Yes, absolutely, I think it should be easy. In the meantime, you can use ISQ to get the same speed.


@rodion-m

It'll be great to see WizardLM-2 and suzume. And thanks for a great tool!

Thanks! I think suzume is just finetuned Llama so that can be used already. I'll add WizardLM.


@W4G1

Command-R and Command-R+ from Cohere would be amazing 🙏

Yes, I'll add those.


@yongkangzhao

T5 and LLaVA

Yes, I'll add those. T5 will be a nice smaller model.

@jett06
Copy link

jett06 commented Apr 29, 2024

@EricLBuehler Thanks for your reply, for adding my suggestion to the model wishlist, and for developing such an awesome project! It's very appreciated :)

@ldt
Copy link

ldt commented Apr 30, 2024

Congrats for your great work!
+1 for vision models like Idefics2-8b or better would be awesome

@maximus2600
Copy link

it would be nice to add some embedding models like nomic-text-embed.

@progressionnetwork
Copy link

Hello, first of all, I want to express my appreciation for the excellent work your team has accomplished on the mistral.rs engine. It's a great project.

I am currently developing a personal AI assistant using Rust, and I believe integrating additional features into your engine could significantly enhance its utility and appeal. Specifically, adding support for Whisper and incorporating Text-to-Speech (TTS) functionalities, such as StyleTTS or similar technologies, would be incredibly beneficial. This would enable the engine to handle LLM inference, speech-to-text, and text-to-speech processes in a unified system very fast (near runtime).

Implementing these features could transform the engine into a more versatile tool for developers like myself, who are keen on building more integrated and efficient AI applications.

@EricLBuehler
Copy link
Owner Author

@jett06, I just added quantized GGUF Phi-3 support in #276! That is without LongRope support currently, but you can use a plain model with ISQ.

@jett06
Copy link

jett06 commented May 9, 2024

@EricLBuehler Woah, thank you so much! This will be lovely for us folks with less powerful computers or size constraints, you're awesome :)

@EricLBuehler
Copy link
Owner Author

@jett06, my pleasure! I just fixed a small bug (in case you saw the strange behavior), so it should be all ready to go now!

@NeroHin
Copy link

NeroHin commented May 10, 2024

IBM's Granite series Code Models.

Granite Code Models

@LLukas22
Copy link
Contributor

@NeroHin

IBM's Granite series Code Models.

Granite Code Models

The 3b and 8b variants should already be supported as they are just based on the llama architecture.

The 20b and 34b variants are based on the GPTBigCode architecture which currently isn't implemented in mistral.rs.

@chenwanqq
Copy link
Contributor

Hello! Any plans for adding multimodal (e.g. llava) and embedding models?

I'm working on it now.chenwanqq/candle-llava
It's not easy dude, tons of image preprocess and tensor concat.

@EricLBuehler
Copy link
Owner Author

Ok, great.

@chenwanqq
Copy link
Contributor

Ok, great.

You can check my #422 . I hope you don't mind me modify the API of Nonzero🙉

@EricLBuehler
Copy link
Owner Author

Not a problem 😄

@wseaton
Copy link
Contributor

wseaton commented Jun 11, 2024

@NeroHin

IBM's Granite series Code Models.
Granite Code Models

The 3b and 8b variants should already be supported as they are just based on the llama architecture.

The 20b and 34b variants are based on the GPTBigCode architecture which currently isn't implemented in mistral.rs.

The 3b and 8b variants do not work out of the box, they rely on tie word embeddings (which I was able to get working in mistral.rs), but the BPE tokenizer breaks because there are some tokens in the vocab list that are > 255 characters.

+1 to getting support for GPTBigCode and other starcoder model variants.

@chenwanqq
Copy link
Contributor

@EricLBuehler I'm stil working on LLaVA. Meanwhile, with so much experience with rust and Candle, have you ever encountered any problem about memory usage? I have some kinds of confusion. huggingface/candle#2273 (comment)

@EricLBuehler
Copy link
Owner Author

@chenwanqq, that is great, let me know if I can help!

I replied to the discussion 2272. However, I discovered that the shadowing does mean that the big tensor will not get dropped! See this playground and my comment for more details.

I'll add a clippy lint here to avoid this on our end.

@bachp
Copy link

bachp commented Jun 26, 2024

@EricLBuehler What is missing for GGUF quantized Qwen2?

@EricLBuehler
Copy link
Owner Author

Hi @bachp, that should be relatively easy to add, it would take inspiration from the other GGUF models such as quantized_phi3.rs. Do you think you would be able to add this?

@EricLBuehler
Copy link
Owner Author

We will be adding the Gemma 2 models shortly, see #486!

@EricLBuehler
Copy link
Owner Author

@francis2tm @chelbos @yongkangzhao we just merged LLaVA and LLaVA Next support. Kudos to @chenwanqq for their great work!

For vision models we now have:

  • Idefics 2
  • Phi 3 vision
  • LLaVA and LLaVA Next

@csicar
Copy link
Contributor

csicar commented Jul 5, 2024

I may be able to provide an implementation for whisper asr. If there is interest in that

@sammcj
Copy link

sammcj commented Jul 7, 2024

It doesn't look like it's been mentioned yet but DeepSeek Coder v2 (lite) support would be amazing given it's probably the best coding model out there.

@EricLBuehler
Copy link
Owner Author

@csicar that would be amazing!

@EricLBuehler
Copy link
Owner Author

It doesn't look like it's been mentioned yet but DeepSeek Coder v2 (lite) support would be amazing given it's probably the best coding model out there.

@sammcj that would be great, I can add that.

@bachp
Copy link

bachp commented Jul 15, 2024

Hi @bachp, that should be relatively easy to add, it would take inspiration from the other GGUF models such as quantized_phi3.rs. Do you think you would be able to add this?

Not sure I'm up to the task yet. However I noticed that candle added support for quantized Qwen2, can we re-use this?

@EricLBuehler
Copy link
Owner Author

@bachp yes. If you want to add that, feel free. I can take a look in a few days.

@joshpopelka20
Copy link
Contributor

I'd like to try mistralai/Mistral-Nemo-Instruct-2407 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 Sounds like it has a similar architecture to Mistral 7B so hoping it won't be too much work.

@EricLBuehler
Copy link
Owner Author

@joshpopelka20 I just merged this in #595!

cargo run --release --features ... -- -i --isq Q4K plain -m mistralai/Mistral-Nemo-Instruct-2407 -a mistral

@joshpopelka20
Copy link
Contributor

joshpopelka20 commented Jul 22, 2024

@EricLBuehler thanks for adding that feature. I haven't been able to get it to run as I'm having an issue with paged attention code. I'll add an issue to track and give more details.

@fredconex
Copy link

Codestral Mamba

@oldgithubman
Copy link

Athene if it isn't already supported by llama

@Remember20240719
Copy link

Remember20240719 commented Jul 28, 2024

Hello, thank you for open-sourcing this project!

I would be interested in running Mistral Large Instruct 2407 GGUF.

Trying to run inference on the Q5 K S quant with mistral.rs commit 38fb942 I get :

MISTRALRS_DEBUG=1 ./target/release/./mistralrs-server --port 1234 --throughput gguf --quantized-model-id $D/models/ --quantized-filename Mistral-Large-Instruct-2407-Q5_K_S-00001-of-00003.gguf 
2024-07-28T17:47:45.176615Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-07-28T17:47:45.176929Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2024-07-28T17:47:45.177583Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-07-28T17:47:45.180482Z  INFO mistralrs_core::pipeline::paths: Loading `Mistral-Large-Instruct-2407-Q5_K_S-00001-of-00003.gguf` locally at `$D/models/Mistral-Large-Instruct-2407-Q5_K_S-00001-of-00003.gguf`
2024-07-28T17:47:45.181595Z  INFO mistralrs_core::pipeline::gguf: Loading model `$D/models/` on cpu.
2024-07-28T17:47:45.266615Z  INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.basename: Mistral
general.file_type: 16
general.finetune: Instruct
general.languages: en, fr, de, es, it, pt, zh, ja, ru, ko
general.license: other
general.license.link: https://mistral.ai/licenses/MRL-0.1.md
general.license.name: mrl
general.name: Mistral Large Instruct 2407
general.quantization_version: 2
general.size_label: Large
general.type: model
general.version: 2407
llama.attention.head_count: 96
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 88
llama.context_length: 131072
llama.embedding_length: 12288
llama.feed_forward_length: 28672
llama.rope.dimension_count: 128
llama.rope.freq_base: 1000000
llama.vocab_size: 32768
quantize.imatrix.chunks_count: 148
quantize.imatrix.dataset: /training_dir/calibration_datav3.txt
quantize.imatrix.entries_count: 616
quantize.imatrix.file: /models_out/Mistral-Large-Instruct-2407-GGUF/Mistral-Large-Instruct-2407.imatrix
split.count: 3
split.no: 0
split.tensors.count: 795
2024-07-28T17:47:45.267503Z  INFO mistralrs_core::pipeline::gguf: Debug is enabled, wrote the names and information about each tensor to `mistralrs_gguf_tensors.txt`.
2024-07-28T17:47:45.316860Z  INFO mistralrs_core::gguf::gguf_tokenizer: GGUF tokenizer model is `llama`, kind: `Unigram`, num tokens: 32768, num added tokens: 0, num merges: 0, num scores: 32768
2024-07-28T17:47:45.316880Z  INFO mistralrs_core::gguf::gguf_tokenizer: Tokenizer: Tokenizer(TokenizerImpl { normalizer: Some(Sequence(Sequence { normalizers: [Prepend(Prepend { prepend: "▁" }), Replace(Replace { pattern: String(" "), content: "▁", regex: SysRegex { regex: Regex { raw: 0x571b958c9200 } } })] })), pre_tokenizer: None, model: Unigram(Unigram { vocab: 32768, unk_id: Some(0), byte_fallback: true }), post_processor: None, decoder: Some(Sequence(Sequence { decoders: [Replace(Replace { pattern: String("▁"), content: " ", regex: SysRegex { regex: Regex { raw: 0x571b958c9000 } } }), ByteFallback(ByteFallback { type_: MustBe!("ByteFallback") }), Fuse(Fuse { type_: MustBe!("Fuse") }), Strip(Strip { content: ' ', start: 1, stop: 0 })] })), added_vocabulary: AddedVocabulary { added_tokens_map: {"</s>": 2, "<unk>": 0, "<s>": 1}, added_tokens_map_r: {2: AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 0: AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, 1: AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }}, added_tokens: [], special_tokens: [AddedToken { content: "<s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "</s>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }, AddedToken { content: "<unk>", single_word: false, lstrip: false, rstrip: false, normalized: false, special: true }], special_tokens_set: {"<unk>", "<s>", "</s>"}, split_trie: (AhoCorasick(dfa::DFA(
D 000000: \x00-\x0E => 0
F 000016:
* 000032: \x00-\x0E => 0
 matches: 1
* 000048: \x00-\x0E => 0
 matches: 2
* 000064: \x00-\x0E => 0
 matches: 0
 >000080: \x00-\x02 => 80, \x03 => 208, \x04-\x0E => 80
  000096: \x00-\x02 => 0, \x03 => 208, \x04-\x0E => 0
  000112: \x00-\x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 128, \x0C-\x0E => 80
  000128: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 32, \x06-\x0E => 80
  000144: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 64, \x06-\x0E => 80
  000160: \x00-\x02 => 80, \x03 => 208, \x04-\x08 => 80, \t => 176, \n-\x0E => 80
  000176: \x00-\x02 => 80, \x03 => 208, \x04-\x06 => 80, \x07 => 192, \x08-\x0E => 80
  000192: \x00-\x02 => 80, \x03 => 208, \x04 => 80, \x05 => 48, \x06-\x0E => 80
  000208: \x00 => 80, \x01 => 112, \x02 => 80, \x03 => 208, \x04-\n => 80, \x0B => 144, \x0C => 80, \r => 160, \x0E => 80
match kind: LeftmostLongest
prefilter: true
state length: 14
pattern length: 3
shortest pattern length: 3
longest pattern length: 5
alphabet length: 15
stride: 16
byte classes: ByteClasses(0 => [0-46], 1 => [47], 2 => [48-59], 3 => [60], 4 => [61], 5 => [62], 6 => [63-106], 7 => [107], 8 => [108-109], 9 => [110], 10 => [111-114], 11 => [115], 12 => [116], 13 => [117], 14 => [118-255])
memory usage: 992
)
), [1, 2, 0]), split_normalized_trie: (AhoCorasick(dfa::DFA(
D 000000: \x00 => 0
F 000001:
 >000002: \x00 => 2
  000003: \x00 => 0
match kind: LeftmostLongest
prefilter: false
state length: 4
pattern length: 0
shortest pattern length: 18446744073709551615
longest pattern length: 0
alphabet length: 1
stride: 1
byte classes: ByteClasses(0 => [0-255])
memory usage: 16
)
), []), encode_special_tokens: false }, truncation: None, padding: None })
2024-07-28T17:47:45.318706Z  INFO mistralrs_core::gguf::chat_template: Discovered and using GGUF chat template: `{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content'] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message['role'] == 'user' %}\n        {%- if loop.last and system_message is defined %}\n            {{- '[INST] ' + system_message + '\n\n' + message['content'] + '[/INST]' }}\n        {%- else %}\n            {{- '[INST] ' + message['content'] + '[/INST]' }}\n        {%- endif %}\n    {%- elif message['role'] == 'assistant' %}\n        {{- ' ' + message['content'] + eos_token}}\n    {%- else %}\n        {{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}\n    {%- endif %}\n{%- endfor %}\n`
Error: cannot find tensor info for output_norm.weight

Attached: mistralrs_gguf_tensors.txt

@dancixx
Copy link

dancixx commented Aug 18, 2024

Hi guys, thanks for the awesome work. Is there any plan to support Idefics3 and InternVl2?

@bhupesh-sf
Copy link

hey, thanks for this awesome work as it allows people with fewer resources to run LLMs and VLMs on their machines.

Are we planning to support TTS, STT and image generation models as well? There is a lot of buzz around Flux.1 these days. There are also some good open-source models out there for voice cloning etc.

But once again I must appreciate projects like these to help out the community. 🥇

@EricLBuehler
Copy link
Owner Author

@bhupesh-sf, yes, I'm planning to expand into the multimodal space with a broad variety of models. As you suggested, TTS, STT, and image generation are all on the table as well as embedding models.

@dancixx, yes, I plan to add Idefics 3 at least!

@jasinco
Copy link

jasinco commented Aug 22, 2024

So does it support Deepseek Coder yet?

@pigfoot
Copy link

pigfoot commented Sep 3, 2024

I'd appreciate if Qwen2-VL could be considered to add: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct

@ariqpradipa
Copy link

ariqpradipa commented Sep 17, 2024

I suggest considering the addition of https://huggingface.co/openbmb/MiniCPM3-4B as well.

@youcefs21
Copy link

Pixtral! https://mistral.ai/news/pixtral-12b/

@oldgithubman
Copy link

DeepSeek-V2.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures
Projects
None yet
Development

No branches or pull requests