-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First take on a complete set of commands and responses #1
Conversation
@DarumaDocker Please have a look and confirm that your backend database can deliver those data fields. Thanks. |
In the screenshot from LMStudio, all of the fields can't be fetched from the API of HuggingFace. |
Yes, I think we need to maintain a model database with additional data on our side. Do you want those fields to be explicitly defined in the channel enums? |
moxin-protocol/src/protocol.rs
Outdated
pub struct ModelExecutionConfig { | ||
pub cpu_threads: u32, | ||
pub gpu_layers: GPULayers, | ||
pub use_mlock: bool, | ||
pub n_batch: u32, | ||
pub n_ctx: u32, | ||
pub rope_freq_scale: f32, | ||
pub rope_freq_base: f32, | ||
|
||
// TBD Not really sure if this is something backend manages or if it is matter of | ||
// the client (if it is done by tweaking the JSON payload for the chat completition) | ||
pub context_overflow_policy: ContextOverflowPolicy | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like you to take a look at ModelExecutionConfig
, referencing LoadModel
, where reverse_prompt
and prompt_template
are important parameters for some models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. I think it is a valid alternative, as long it makes sense on your end. I will change it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/WasmEdge/WasmEdge/blob/7c13bee99c796317fc68d5bf46d737f9b51d06fe/plugins/wasi_nn/ggml.cpp#L52-L57
We have these parameters, but we didn't see min-p and top-k. Are these required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define all the parameters in the protocol. If some are not yet supported on LlamaEdge, we can ignore in the backend for now. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are now implemented
moxin-protocol/src/protocol.rs
Outdated
FileDownloadProgress(FileID, f32), // Progress value from 0.0 to 1.0 | ||
FileDownloaded(File), // The downloaded_path field is filled | ||
|
||
LoadModelProgress(FileID, f32), // Progress value from 0.0 to 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we currently have to provide progress during the model loading process, or is it sufficient to have a message when loading is complete? Because currently, we are unable to provide progress during the model loading process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea, I understand, is to try to implement everything we see in LM Studio. This is something that they implemented.
I will make a note to discuss it in our next weekly call, but yeah.. let's keep this requirement for the future.
moxin-protocol/src/protocol.rs
Outdated
pub port: u16, | ||
pub cors: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our server is not an HTTP server. Does it need to provide HTTP services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One relevant feature in LM Studio is the ability to spawn a local web server to interact with the model via CURL or scripts.
@juntao What do you think? Maybe it is a topic to cover in the next weekly call, since this is a feature we were not taking about in those calls.
Is it acceptable to include the sender for the response as part of the request, similar to PR2? |
@jmbejar I have a question. From the
But I'm not able to fetch all these fields from HuggingFace's API. Especially for the So we will provide a SDK for you to update these information manually. |
Yeah, @juntao has raised the topic in today's call. I would say the answer is "yes" (although I don't feel I'm the decision-maker here!), we have to find a way. But it is understood this is a more complicated thing than anticipated. |
I was thinking on this idea, and yes.. it should be a better approach for both ends. Let me rearrange things based on your other reference PR |
Idea implemented in this PR now 👍 |
@L-jasmine I ported part of the suggested changes demonstrated in #2. Thanks for the ideas and examples! What I did:
|
@L-jasmine @juntao In addition, I added Result wrappers to responses so backend can inform if something is wrong. In some cases is just a way to have an ACK, for example for the command Please let me know if you have comments on this addition. |
moxin-protocol/src/protocol.rs
Outdated
speed: f32, | ||
gpu_layers: u32, | ||
cpu_threads: u32, | ||
mlock: bool, | ||
token_count: u32, | ||
token_limit: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to include this metadata in ChatCompletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a feature in LMStudio:
The numbers you see below are stats about the last chat interaction with the model, I understand.
Not saying we have to implement everything at once! However, we are trying to capture here everything we would need to mimic LMStudio features. For instance, I would say that it is OK if the initial versions of the backend implementation does not care about those fields (leaving it in zero or default value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. My understanding is that gpu_layers
, cpu_threads
, and mlock
come from the frontend. The frontend is aware of them and their changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. My understanding is that gpu_layers, cpu_threads, and mlock come from the frontend. The frontend is aware of them and their changes.
Good point.. so I can simply use the values we sent in the first place. Makes sense 👍
moxin-protocol/src/protocol.rs
Outdated
pub struct ChatCompletionData { | ||
// The response from the model in JSON format | ||
// https://platform.openai.com/docs/api-reference/chat/object | ||
response: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a struct that implements the serde to replace this string?
This would allow us to have a clear understanding of the specific contents during development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
moxin-protocol/src/protocol.rs
Outdated
EjectModel(Sender<Result<()>>), | ||
|
||
// The argument is the chat message in JSON format, following https://platform.openai.com/docs/api-reference/chat/create | ||
Chat(String, Sender<Result<ChatResponse, ChatError>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as ChatCompletionData::response. Could we use a struct that implements the serde to replace this string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair idea. Will change it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm on hold with this change because I raised a requirements question on discord (visible to @juntao and others).
In LMStudio I have difference inference parameters than the ones defined in https://platform.openai.com/docs/api-reference/chat/create, so it's not clear to me which set we want to use.
In any case, let's change here to be explicit using rust structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to move forward with this idea, although I will confirm in tomorrow's call that our UI will use them.
Now OpenAI fields are part of structs/enums in our protocol.
moxin-protocol/src/open_ai.rs
Outdated
pub struct ChatRequestData { | ||
// Not really necesary but it is part of the OpenAI API. We are going to send the id | ||
// of the model currently loaded. | ||
pub model: ModelID, | ||
|
||
pub frequency_penalty: Option<f32>, | ||
pub logprobs: Option<bool>, | ||
pub top_logprobs: Option<u32>, | ||
pub max_tokens: Option<u32>, | ||
pub presence_penalty: Option<f32>, | ||
pub seed: Option<u32>, | ||
pub stop: Option<Vec<String>>, | ||
pub stream: Option<bool>, | ||
pub temperature: Option<f32>, | ||
pub top_p: Option<f32>, | ||
|
||
// Adding the following fields since there are part of the OpenAI API, | ||
// but are not likely to be used in the first version of the client | ||
pub n: Option<u32>, | ||
pub logit_bias: Option<HashMap<String, f32>>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This request is incorrect, it's missing user input. It seems to be a response from OpenAI,not a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it's missing the messages
field, the first one defined here https://platform.openai.com/docs/api-reference/chat/object. It's the most important part of the request for sure!
Adding it now.
Note that the rest of the fields are actually part of the request (not the response), according to this OpenAI definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thanks for catching this issue.
pub enum StopReason { | ||
#[serde(rename = "stop")] | ||
Stop, | ||
#[serde(rename = "length")] | ||
Length, | ||
#[serde(rename = "content_filter")] | ||
ContentFilter | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we merge ChatError into StopReason, as they essentially represent the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! ChatError
is removed now. Stop reason is embedded inside the response struct (based on OpenAI format)
@jmbejar I notice that the word
|
Good question. I'm still unsure if this is something we want to do in backend or frontend, but I'd say there is no much difference since this seems to be a "guess" based on model file size and the local machine available memory (or gpu memory). But I don't know what LMStudio does after all. Here are some pics of what I see in the app: Again, I guess the compatibility it is all based on file size but I could be wrong. If you are OK, let's decide this is a frontend concern for now. I will try to implement it based on what we have now and see if it's good enough. We can always iterate the communication protocol if the backend needs to provide something else. |
I'm OK with implementing this specific filtering in the frontend 👍 |
This is my proposal for the interaction between front and backend parts of the system.
For most of the fields, I'm leaving inline comments so there is some reference.
However, in order to clarify some parts of the protocol, I'm attaching here screenshots from LMStudio.
Model information
In this PR, the field
information
inLoadedModelInfo
andDownloadedFile
is meant to store this kind of information.Context Overflow Policy
You will see a
ContextOverflowPolicy
enum for this choice. Not really sure what LMStudio does with it, but there is a chance that the affected part is the request we have to do to the chat model. If that's the case, the frontend would take care and, as a consequence, we won't need to have this enum in the protocol.