First take on a complete set of commands and responses #1

jmbejar · 2024-02-22T20:25:40Z

This is my proposal for the interaction between front and backend parts of the system.

For most of the fields, I'm leaving inline comments so there is some reference.
However, in order to clarify some parts of the protocol, I'm attaching here screenshots from LMStudio.

Model information

In this PR, the field information in LoadedModelInfo and DownloadedFile is meant to store this kind of information.

Context Overflow Policy

You will see a ContextOverflowPolicy enum for this choice. Not really sure what LMStudio does with it, but there is a chance that the affected part is the request we have to do to the chat model. If that's the case, the frontend would take care and, as a consequence, we won't need to have this enum in the protocol.

juntao · 2024-02-22T20:36:17Z

@DarumaDocker Please have a look and confirm that your backend database can deliver those data fields. Thanks.

DarumaDocker · 2024-02-23T01:07:48Z

@DarumaDocker Please have a look and confirm that your backend database can deliver those data fields. Thanks.

In the screenshot from LMStudio, all of the fields can't be fetched from the API of HuggingFace.
So maybe I should add a field named 'model_information' for some folks to maintain. And the 'prompt_template_type' and 'reverse_prompt' should also be put into this new field.

juntao · 2024-02-23T01:50:30Z

Yes, I think we need to maintain a model database with additional data on our side.

Do you want those fields to be explicitly defined in the channel enums?

L-jasmine · 2024-02-27T08:06:52Z

moxin-protocol/src/protocol.rs

+pub struct ModelExecutionConfig {
+    pub cpu_threads: u32,
+    pub gpu_layers: GPULayers,
+    pub use_mlock: bool,
+    pub n_batch: u32,
+    pub n_ctx: u32,
+    pub rope_freq_scale: f32,
+    pub rope_freq_base: f32,
+
+    // TBD Not really sure if this is something backend manages or if it is matter of
+    // the client (if it is done by tweaking the JSON payload for the chat completition)
+    pub context_overflow_policy: ContextOverflowPolicy
+}


I would like you to take a look at ModelExecutionConfig, referencing LoadModel, where reverse_prompt and prompt_template are important parameters for some models.

I got it. I think it is a valid alternative, as long it makes sense on your end. I will change it 👍

BTW Looking again at LM Studio, I found that cpu_threads is something that can change with each chat interaction, so if you're ok I'm gonna remove from this list and include it as part of the JSON payload for each chat message (where other related values such us temp, n_predict, goes)

https://github.com/WasmEdge/WasmEdge/blob/7c13bee99c796317fc68d5bf46d737f9b51d06fe/plugins/wasi_nn/ggml.cpp#L52-L57
We have these parameters, but we didn't see min-p and top-k. Are these required?

Let's define all the parameters in the protocol. If some are not yet supported on LlamaEdge, we can ignore in the backend for now. Thanks.

These changes are now implemented

L-jasmine · 2024-02-27T08:17:42Z

moxin-protocol/src/protocol.rs

+    FileDownloadProgress(FileID, f32), // Progress value from 0.0 to 1.0
+    FileDownloaded(File), // The downloaded_path field is filled
+
+    LoadModelProgress(FileID, f32), // Progress value from 0.0 to 1.0


Do we currently have to provide progress during the model loading process, or is it sufficient to have a message when loading is complete? Because currently, we are unable to provide progress during the model loading process.

The idea, I understand, is to try to implement everything we see in LM Studio. This is something that they implemented.

I will make a note to discuss it in our next weekly call, but yeah.. let's keep this requirement for the future.

L-jasmine · 2024-02-27T08:24:02Z

moxin-protocol/src/protocol.rs

+    pub port: u16,
+    pub cors: bool,


Our server is not an HTTP server. Does it need to provide HTTP services?

One relevant feature in LM Studio is the ability to spawn a local web server to interact with the model via CURL or scripts.

@juntao What do you think? Maybe it is a topic to cover in the next weekly call, since this is a feature we were not taking about in those calls.

L-jasmine · 2024-02-27T08:28:21Z

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

DarumaDocker · 2024-02-27T08:35:17Z

@jmbejar I have a question. From the fake_data.rs，I find you need these fields of the model:

Model {
    id: "teknium/OpenHermes-2.5-Mistral-7B".to_string(),
    name: "OpenHermes 2.5 Mistral 7B".to_string(),
    summary: "OpenHermes 2.5 Mistral 7B is an advanced iteration of the OpenHermes 2 language model, enhanced by training on a significant proportion of code datasets. This additional training improved performance across several benchmarks, notably TruthfulQA, AGIEval, and the GPT4All suite, while slightly decreasing the BigBench score. Notably, the model's ability to handle code-related tasks, measured by the humaneval score...".to_string(),
    size: "7B params".to_string(),
    requires: "8GB+ RAM".to_string(),
    released_at: NaiveDate::from_ymd_opt(2023, 10, 29).unwrap(),
    architecture: "Mistral".to_string(),
    files: open_hermes_files,
    author: Author {
        name: "Teknium".to_string(),
        url: "https://github.com/teknium1".to_string(),
        description: "Creator of numerous chart topping fine-tunes and a Co-founder of NousResearch.".to_string(),
    },
    like_count: 10,
    download_count: 5003,
}

But I'm not able to fetch all these fields from HuggingFace's API. Especially for the files field, I can only get the file name besides size, quantization, etc.

So we will provide a SDK for you to update these information manually.
Does it make sense to you?

jmbejar · 2024-02-27T19:11:32Z

@jmbejar I have a question. From the fake_data.rs，I find you need these fields of the model:

Model {
    id: "teknium/OpenHermes-2.5-Mistral-7B".to_string(),
    name: "OpenHermes 2.5 Mistral 7B".to_string(),
    summary: "OpenHermes 2.5 Mistral 7B is an advanced iteration of the OpenHermes 2 language model, enhanced by training on a significant proportion of code datasets. This additional training improved performance across several benchmarks, notably TruthfulQA, AGIEval, and the GPT4All suite, while slightly decreasing the BigBench score. Notably, the model's ability to handle code-related tasks, measured by the humaneval score...".to_string(),
    size: "7B params".to_string(),
    requires: "8GB+ RAM".to_string(),
    released_at: NaiveDate::from_ymd_opt(2023, 10, 29).unwrap(),
    architecture: "Mistral".to_string(),
    files: open_hermes_files,
    author: Author {
        name: "Teknium".to_string(),
        url: "https://github.com/teknium1".to_string(),
        description: "Creator of numerous chart topping fine-tunes and a Co-founder of NousResearch.".to_string(),
    },
    like_count: 10,
    download_count: 5003,
}

But I'm not able to fetch all these fields from HuggingFace's API. Especially for the files field, I can only get the file name besides size, quantization, etc.

So we will provide a SDK for you to update these information manually. Does it make sense to you?

Yeah, @juntao has raised the topic in today's call. I would say the answer is "yes" (although I don't feel I'm the decision-maker here!), we have to find a way. But it is understood this is a more complicated thing than anticipated.

jmbejar · 2024-02-28T02:12:16Z

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

I was thinking on this idea, and yes.. it should be a better approach for both ends. Let me rearrange things based on your other reference PR

jmbejar · 2024-02-28T13:59:09Z

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

Idea implemented in this PR now 👍

jmbejar · 2024-02-28T14:11:04Z

1- I replaced std::sync::mpsc with crossbeam, as I believe it offers better performance.
2- I added some new commands: Chat, and LoadModel.
3- I moved the Sender for sending responses to the Command, which may make it clearer which command the response is coming from.

@L-jasmine I ported part of the suggested changes demonstrated in #2. Thanks for the ideas and examples!

What I did:

Implemented 3 as you suggested, it is a good idea for sure. In some cases I had to add specific response enums because the same channel would be used in a long-lived fashion (e.g. the Chat command will receive an initial response when model is loaded, but then we will expect to have periodical updates about resources usage (ram, cpu)).
Implemented partially what you did with (2), but I'm still sticking to the decision of having chat messages in OpenAI API format. I mean, the request is basically a string with a JSON representation of the "inference request" and we will receive a JSON encoded response (according to https://platform.openai.com/docs/api-reference/chat/object and https://platform.openai.com/docs/api-reference/chat/streaming). It is more complex than only sending token by token, but I feel it is more an standard, as @juntao suggested in our discord channel.
I don't feel adding crossbeam would be a gain for now, since we're not using any of the provided extra features. In fact, newer Rust versions have similar implementation (see Merge crossbeam-channel into std::sync::mpsc rust-lang/rust#93563 and Implement changes from std crossbeam-rs/crossbeam#935). In any case, if you feel stronger on this matter, we could add it without further discussion and measure performance and build size later.

jmbejar · 2024-02-28T14:11:49Z

@L-jasmine @juntao In addition, I added Result wrappers to responses so backend can inform if something is wrong. In some cases is just a way to have an ACK, for example for the command EjectModel.

Please let me know if you have comments on this addition.

…ucts

L-jasmine · 2024-02-28T17:24:11Z

moxin-protocol/src/protocol.rs

+    speed: f32,
+    gpu_layers: u32,
+    cpu_threads: u32,
+    mlock: bool,
+    token_count: u32,
+    token_limit: u32,


Why is it necessary to include this metadata in ChatCompletion?

This is a feature in LMStudio:

The numbers you see below are stats about the last chat interaction with the model, I understand.

Not saying we have to implement everything at once! However, we are trying to capture here everything we would need to mimic LMStudio features. For instance, I would say that it is OK if the initial versions of the backend implementation does not care about those fields (leaving it in zero or default value).

Ok, I see. My understanding is that gpu_layers, cpu_threads, and mlock come from the frontend. The frontend is aware of them and their changes.

Ok, I see. My understanding is that gpu_layers, cpu_threads, and mlock come from the frontend. The frontend is aware of them and their changes.

Good point.. so I can simply use the values we sent in the first place. Makes sense 👍

L-jasmine · 2024-02-28T17:29:27Z

moxin-protocol/src/protocol.rs

+pub struct ChatCompletionData {
+    // The response from the model in JSON format
+    // https://platform.openai.com/docs/api-reference/chat/object
+    response: String,


Could we use a struct that implements the serde to replace this string?
This would allow us to have a clear understanding of the specific contents during development.

L-jasmine · 2024-02-28T17:31:39Z

moxin-protocol/src/protocol.rs

+    EjectModel(Sender<Result<()>>),
+
+    // The argument is the chat message in JSON format, following https://platform.openai.com/docs/api-reference/chat/create
+    Chat(String, Sender<Result<ChatResponse, ChatError>>),


Same as ChatCompletionData::response. Could we use a struct that implements the serde to replace this string?

That's a fair idea. Will change it 👍

FYI I'm on hold with this change because I raised a requirements question on discord (visible to @juntao and others).

In LMStudio I have difference inference parameters than the ones defined in https://platform.openai.com/docs/api-reference/chat/create, so it's not clear to me which set we want to use.

In any case, let's change here to be explicit using rust structs.

I decided to move forward with this idea, although I will confirm in tomorrow's call that our UI will use them.
Now OpenAI fields are part of structs/enums in our protocol.

L-jasmine · 2024-03-05T08:40:56Z

moxin-protocol/src/open_ai.rs

+pub struct ChatRequestData {
+    // Not really necesary but it is part of the OpenAI API. We are going to send the id
+    // of the model currently loaded.
+    pub model: ModelID,
+
+    pub frequency_penalty: Option<f32>,
+    pub logprobs: Option<bool>,
+    pub top_logprobs: Option<u32>,
+    pub max_tokens: Option<u32>,
+    pub presence_penalty: Option<f32>,
+    pub seed: Option<u32>,
+    pub stop: Option<Vec<String>>,
+    pub stream: Option<bool>,
+    pub temperature: Option<f32>,
+    pub top_p: Option<f32>,
+
+    // Adding the following fields since there are part of the OpenAI API,
+    // but are not likely to be used in the first version of the client
+    pub n: Option<u32>,
+    pub logit_bias: Option<HashMap<String, f32>>,
+}


This request is incorrect, it's missing user input. It seems to be a response from OpenAI，not a request.

Correct, it's missing the messages field, the first one defined here https://platform.openai.com/docs/api-reference/chat/object. It's the most important part of the request for sure!

Adding it now.

Note that the rest of the fields are actually part of the request (not the response), according to this OpenAI definition.

Done! Thanks for catching this issue.

L-jasmine · 2024-03-05T08:48:39Z

moxin-protocol/src/open_ai.rs

+pub enum StopReason {
+    #[serde(rename = "stop")]
+    Stop,
+    #[serde(rename = "length")]
+    Length,
+    #[serde(rename = "content_filter")]
+    ContentFilter
+}


Could we merge ChatError into StopReason, as they essentially represent the same thing?

Yes! ChatError is removed now. Stop reason is embedded inside the response struct (based on OpenAI format)

DarumaDocker · 2024-03-05T12:37:05Z

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

The option of the Filter By drop down. Two questions:
1. What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?
2. Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.
A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

jmbejar · 2024-03-05T14:57:53Z

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

The option of the Filter By drop down. Two questions:

What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?

Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.

A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

Good question. I'm still unsure if this is something we want to do in backend or frontend, but I'd say there is no much difference since this seems to be a "guess" based on model file size and the local machine available memory (or gpu memory). But I don't know what LMStudio does after all.

Here are some pics of what I see in the app:

Again, I guess the compatibility it is all based on file size but I could be wrong.

If you are OK, let's decide this is a frontend concern for now. I will try to implement it based on what we have now and see if it's good enough. We can always iterate the communication protocol if the backend needs to provide something else.

jmbejar · 2024-03-05T14:59:19Z

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

The option of the Filter By drop down. Two questions:

What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?

Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.

A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

I'm OK with implementing this specific filtering in the frontend 👍

First take on a complete set of commands and responses

31ebcb0

Add missing GPU layers setting to the protocol

091884c

L-jasmine reviewed Feb 27, 2024

View reviewed changes

L-jasmine mentioned this pull request Feb 27, 2024

Update protocol for backend #2

Closed

L-jasmine reviewed Feb 27, 2024

View reviewed changes

jmbejar added 2 commits February 28, 2024 00:03

Change to send specific channel when commands are issued

c1d7a7c

Add explicit error handling to the protocol

2b7f336

Refactor main protocol file using a logical ordering for enum and str…

339fa4b

…ucts

L-jasmine reviewed Feb 28, 2024

View reviewed changes

jmbejar added 2 commits February 29, 2024 10:44

Remove useless fields from ChatCompletionData

dc79d51

Explicit data structures for OpenAI fields

3a8c3d3

L-jasmine reviewed Mar 5, 2024

View reviewed changes

jmbejar added 2 commits March 5, 2024 11:32

Add missing messages field to chat request

f6b352d

Remove useless ChatError

c0797ad

jmbejar merged commit c4a081c into main Mar 5, 2024

jmbejar deleted the protocol branch March 5, 2024 18:55

First take on a complete set of commands and responses #1

First take on a complete set of commands and responses #1

Conversation

jmbejar commented Feb 22, 2024 • edited Loading

Model information

Context Overflow Policy

juntao commented Feb 22, 2024

DarumaDocker commented Feb 23, 2024 • edited Loading

juntao commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

L-jasmine commented Feb 27, 2024

DarumaDocker commented Feb 27, 2024

jmbejar commented Feb 27, 2024

jmbejar commented Feb 28, 2024

jmbejar commented Feb 28, 2024

jmbejar commented Feb 28, 2024

jmbejar commented Feb 28, 2024

Choose a reason for hiding this comment

jmbejar Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarumaDocker commented Mar 5, 2024

jmbejar commented Mar 5, 2024 • edited Loading

jmbejar commented Mar 5, 2024

jmbejar commented Feb 22, 2024 •

edited

Loading

DarumaDocker commented Feb 23, 2024 •

edited

Loading

jmbejar Feb 28, 2024 •

edited

Loading

jmbejar commented Mar 5, 2024 •

edited

Loading