Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First take on a complete set of commands and responses #1

Merged
merged 9 commits into from
Mar 5, 2024
Merged

Conversation

jmbejar
Copy link
Collaborator

@jmbejar jmbejar commented Feb 22, 2024

This is my proposal for the interaction between front and backend parts of the system.

For most of the fields, I'm leaving inline comments so there is some reference.
However, in order to clarify some parts of the protocol, I'm attaching here screenshots from LMStudio.

Model information

image

In this PR, the field information in LoadedModelInfo and DownloadedFile is meant to store this kind of information.

Context Overflow Policy

image

You will see a ContextOverflowPolicy enum for this choice. Not really sure what LMStudio does with it, but there is a chance that the affected part is the request we have to do to the chat model. If that's the case, the frontend would take care and, as a consequence, we won't need to have this enum in the protocol.

@juntao
Copy link
Collaborator

juntao commented Feb 22, 2024

@DarumaDocker Please have a look and confirm that your backend database can deliver those data fields. Thanks.

@DarumaDocker
Copy link
Collaborator

DarumaDocker commented Feb 23, 2024

@DarumaDocker Please have a look and confirm that your backend database can deliver those data fields. Thanks.

In the screenshot from LMStudio, all of the fields can't be fetched from the API of HuggingFace.
So maybe I should add a field named 'model_information' for some folks to maintain. And the 'prompt_template_type' and 'reverse_prompt' should also be put into this new field.

@juntao
Copy link
Collaborator

juntao commented Feb 23, 2024

Yes, I think we need to maintain a model database with additional data on our side.

Do you want those fields to be explicitly defined in the channel enums?

Comment on lines 18 to 30
pub struct ModelExecutionConfig {
pub cpu_threads: u32,
pub gpu_layers: GPULayers,
pub use_mlock: bool,
pub n_batch: u32,
pub n_ctx: u32,
pub rope_freq_scale: f32,
pub rope_freq_base: f32,

// TBD Not really sure if this is something backend manages or if it is matter of
// the client (if it is done by tweaking the JSON payload for the chat completition)
pub context_overflow_policy: ContextOverflowPolicy
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like you to take a look at ModelExecutionConfig, referencing LoadModel, where reverse_prompt and prompt_template are important parameters for some models.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. I think it is a valid alternative, as long it makes sense on your end. I will change it 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW Looking again at LM Studio, I found that cpu_threads is something that can change with each chat interaction, so if you're ok I'm gonna remove from this list and include it as part of the JSON payload for each chat message (where other related values such us temp, n_predict, goes)

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/WasmEdge/WasmEdge/blob/7c13bee99c796317fc68d5bf46d737f9b51d06fe/plugins/wasi_nn/ggml.cpp#L52-L57
We have these parameters, but we didn't see min-p and top-k. Are these required?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define all the parameters in the protocol. If some are not yet supported on LlamaEdge, we can ignore in the backend for now. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are now implemented

FileDownloadProgress(FileID, f32), // Progress value from 0.0 to 1.0
FileDownloaded(File), // The downloaded_path field is filled

LoadModelProgress(FileID, f32), // Progress value from 0.0 to 1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently have to provide progress during the model loading process, or is it sufficient to have a message when loading is complete? Because currently, we are unable to provide progress during the model loading process.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea, I understand, is to try to implement everything we see in LM Studio. This is something that they implemented.

I will make a note to discuss it in our next weekly call, but yeah.. let's keep this requirement for the future.

Comment on lines 34 to 35
pub port: u16,
pub cors: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our server is not an HTTP server. Does it need to provide HTTP services?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One relevant feature in LM Studio is the ability to spawn a local web server to interact with the model via CURL or scripts.

@juntao What do you think? Maybe it is a topic to cover in the next weekly call, since this is a feature we were not taking about in those calls.

@L-jasmine
Copy link
Collaborator

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

@DarumaDocker
Copy link
Collaborator

@jmbejar I have a question. From the fake_data.rs,I find you need these fields of the model:

Model {
    id: "teknium/OpenHermes-2.5-Mistral-7B".to_string(),
    name: "OpenHermes 2.5 Mistral 7B".to_string(),
    summary: "OpenHermes 2.5 Mistral 7B is an advanced iteration of the OpenHermes 2 language model, enhanced by training on a significant proportion of code datasets. This additional training improved performance across several benchmarks, notably TruthfulQA, AGIEval, and the GPT4All suite, while slightly decreasing the BigBench score. Notably, the model's ability to handle code-related tasks, measured by the humaneval score...".to_string(),
    size: "7B params".to_string(),
    requires: "8GB+ RAM".to_string(),
    released_at: NaiveDate::from_ymd_opt(2023, 10, 29).unwrap(),
    architecture: "Mistral".to_string(),
    files: open_hermes_files,
    author: Author {
        name: "Teknium".to_string(),
        url: "https://github.com/teknium1".to_string(),
        description: "Creator of numerous chart topping fine-tunes and a Co-founder of NousResearch.".to_string(),
    },
    like_count: 10,
    download_count: 5003,
}

But I'm not able to fetch all these fields from HuggingFace's API. Especially for the files field, I can only get the file name besides size, quantization, etc.

So we will provide a SDK for you to update these information manually.
Does it make sense to you?

@jmbejar
Copy link
Collaborator Author

jmbejar commented Feb 27, 2024

@jmbejar I have a question. From the fake_data.rs,I find you need these fields of the model:

Model {
    id: "teknium/OpenHermes-2.5-Mistral-7B".to_string(),
    name: "OpenHermes 2.5 Mistral 7B".to_string(),
    summary: "OpenHermes 2.5 Mistral 7B is an advanced iteration of the OpenHermes 2 language model, enhanced by training on a significant proportion of code datasets. This additional training improved performance across several benchmarks, notably TruthfulQA, AGIEval, and the GPT4All suite, while slightly decreasing the BigBench score. Notably, the model's ability to handle code-related tasks, measured by the humaneval score...".to_string(),
    size: "7B params".to_string(),
    requires: "8GB+ RAM".to_string(),
    released_at: NaiveDate::from_ymd_opt(2023, 10, 29).unwrap(),
    architecture: "Mistral".to_string(),
    files: open_hermes_files,
    author: Author {
        name: "Teknium".to_string(),
        url: "https://github.com/teknium1".to_string(),
        description: "Creator of numerous chart topping fine-tunes and a Co-founder of NousResearch.".to_string(),
    },
    like_count: 10,
    download_count: 5003,
}

But I'm not able to fetch all these fields from HuggingFace's API. Especially for the files field, I can only get the file name besides size, quantization, etc.

So we will provide a SDK for you to update these information manually. Does it make sense to you?

Yeah, @juntao has raised the topic in today's call. I would say the answer is "yes" (although I don't feel I'm the decision-maker here!), we have to find a way. But it is understood this is a more complicated thing than anticipated.

@jmbejar
Copy link
Collaborator Author

jmbejar commented Feb 28, 2024

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

I was thinking on this idea, and yes.. it should be a better approach for both ends. Let me rearrange things based on your other reference PR

@jmbejar
Copy link
Collaborator Author

jmbejar commented Feb 28, 2024

Is it acceptable to include the sender for the response as part of the request, similar to PR2?

Idea implemented in this PR now 👍

@jmbejar
Copy link
Collaborator Author

jmbejar commented Feb 28, 2024

1- I replaced std::sync::mpsc with crossbeam, as I believe it offers better performance.
2- I added some new commands: Chat, and LoadModel.
3- I moved the Sender for sending responses to the Command, which may make it clearer which command the response is coming from.

@L-jasmine I ported part of the suggested changes demonstrated in #2. Thanks for the ideas and examples!

What I did:

  • Implemented 3 as you suggested, it is a good idea for sure. In some cases I had to add specific response enums because the same channel would be used in a long-lived fashion (e.g. the Chat command will receive an initial response when model is loaded, but then we will expect to have periodical updates about resources usage (ram, cpu)).

  • Implemented partially what you did with (2), but I'm still sticking to the decision of having chat messages in OpenAI API format. I mean, the request is basically a string with a JSON representation of the "inference request" and we will receive a JSON encoded response (according to https://platform.openai.com/docs/api-reference/chat/object and https://platform.openai.com/docs/api-reference/chat/streaming). It is more complex than only sending token by token, but I feel it is more an standard, as @juntao suggested in our discord channel.

  • I don't feel adding crossbeam would be a gain for now, since we're not using any of the provided extra features. In fact, newer Rust versions have similar implementation (see Merge crossbeam-channel into std::sync::mpsc rust-lang/rust#93563 and Implement changes from std crossbeam-rs/crossbeam#935). In any case, if you feel stronger on this matter, we could add it without further discussion and measure performance and build size later.

@jmbejar
Copy link
Collaborator Author

jmbejar commented Feb 28, 2024

@L-jasmine @juntao In addition, I added Result wrappers to responses so backend can inform if something is wrong. In some cases is just a way to have an ACK, for example for the command EjectModel.

Please let me know if you have comments on this addition.

Comment on lines 87 to 92
speed: f32,
gpu_layers: u32,
cpu_threads: u32,
mlock: bool,
token_count: u32,
token_limit: u32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to include this metadata in ChatCompletion?

Copy link
Collaborator Author

@jmbejar jmbejar Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a feature in LMStudio:

image

The numbers you see below are stats about the last chat interaction with the model, I understand.

Not saying we have to implement everything at once! However, we are trying to capture here everything we would need to mimic LMStudio features. For instance, I would say that it is OK if the initial versions of the backend implementation does not care about those fields (leaving it in zero or default value).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. My understanding is that gpu_layers, cpu_threads, and mlock come from the frontend. The frontend is aware of them and their changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. My understanding is that gpu_layers, cpu_threads, and mlock come from the frontend. The frontend is aware of them and their changes.

Good point.. so I can simply use the values we sent in the first place. Makes sense 👍

pub struct ChatCompletionData {
// The response from the model in JSON format
// https://platform.openai.com/docs/api-reference/chat/object
response: String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a struct that implements the serde to replace this string?
This would allow us to have a clear understanding of the specific contents during development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

EjectModel(Sender<Result<()>>),

// The argument is the chat message in JSON format, following https://platform.openai.com/docs/api-reference/chat/create
Chat(String, Sender<Result<ChatResponse, ChatError>>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as ChatCompletionData::response. Could we use a struct that implements the serde to replace this string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair idea. Will change it 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I'm on hold with this change because I raised a requirements question on discord (visible to @juntao and others).

In LMStudio I have difference inference parameters than the ones defined in https://platform.openai.com/docs/api-reference/chat/create, so it's not clear to me which set we want to use.

In any case, let's change here to be explicit using rust structs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to move forward with this idea, although I will confirm in tomorrow's call that our UI will use them.
Now OpenAI fields are part of structs/enums in our protocol.

Comment on lines 7 to 27
pub struct ChatRequestData {
// Not really necesary but it is part of the OpenAI API. We are going to send the id
// of the model currently loaded.
pub model: ModelID,

pub frequency_penalty: Option<f32>,
pub logprobs: Option<bool>,
pub top_logprobs: Option<u32>,
pub max_tokens: Option<u32>,
pub presence_penalty: Option<f32>,
pub seed: Option<u32>,
pub stop: Option<Vec<String>>,
pub stream: Option<bool>,
pub temperature: Option<f32>,
pub top_p: Option<f32>,

// Adding the following fields since there are part of the OpenAI API,
// but are not likely to be used in the first version of the client
pub n: Option<u32>,
pub logit_bias: Option<HashMap<String, f32>>,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This request is incorrect, it's missing user input. It seems to be a response from OpenAI,not a request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's missing the messages field, the first one defined here https://platform.openai.com/docs/api-reference/chat/object. It's the most important part of the request for sure!

Adding it now.

Note that the rest of the fields are actually part of the request (not the response), according to this OpenAI definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks for catching this issue.

Comment on lines +58 to +65
pub enum StopReason {
#[serde(rename = "stop")]
Stop,
#[serde(rename = "length")]
Length,
#[serde(rename = "content_filter")]
ContentFilter
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge ChatError into StopReason, as they essentially represent the same thing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! ChatError is removed now. Stop reason is embedded inside the response struct (based on OpenAI format)

@DarumaDocker
Copy link
Collaborator

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

  • The option of the Filter By drop down. Two questions:
    1. What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?
    2. Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.
  • A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

@jmbejar
Copy link
Collaborator Author

jmbejar commented Mar 5, 2024

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

  • The option of the Filter By drop down. Two questions:

    1. What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?
    2. Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.
  • A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

Good question. I'm still unsure if this is something we want to do in backend or frontend, but I'd say there is no much difference since this seems to be a "guess" based on model file size and the local machine available memory (or gpu memory). But I don't know what LMStudio does after all.

Here are some pics of what I see in the app:

image image image

Again, I guess the compatibility it is all based on file size but I could be wrong.

If you are OK, let's decide this is a frontend concern for now. I will try to implement it based on what we have now and see if it's good enough. We can always iterate the communication protocol if the backend needs to provide something else.

@jmbejar
Copy link
Collaborator Author

jmbejar commented Mar 5, 2024

@jmbejar I notice that the word Compatibility Guess appeared in two places in the Figma UI.

  • The option of the Filter By drop down. Two questions:

    1. What information is needed to filter the model? Now the requires field only contains info about RAM. Is CPU or GPU is also needed?
    2. Does this filtering action occur on the front end or on the back end? I recommend to make this happen on the front end, because when the Show All option is selected, you still need to check the 'Compatibility' on the front end.
  • A column of the files list. Similar to the first question above, what extra information should be stored for the file for checking the compatibility?

I'm OK with implementing this specific filtering in the frontend 👍

@jmbejar jmbejar merged commit c4a081c into main Mar 5, 2024
@jmbejar jmbejar deleted the protocol branch March 5, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants