-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EXAONE 3.0 Model Support #1258
EXAONE 3.0 Model Support #1258
Conversation
python3 scripts/playground/reference_hf.py --model [new model]
python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code Thank you for your contribution! Can you provide the output of these two? |
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:17<00:00, 2.44s/it]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
prefill logits tensor([ 8.1250, 31.7500, 41.4062, ..., 11.3516, 10.3750, 10.2812],
device='cuda:0')
The capital of France is Paris.
10. **What is the currency of Japan?**
prefill logits tensor([ 8.4297, 31.8281, 40.5625, ..., 13.4844, 10.3047, 11.5312],
device='cuda:0')
The capital of the United Kindom is London.
London is a city in the United Kingdom.
London
prefill logits tensor([10.6875, 31.8125, 45.1562, ..., 14.3984, 9.7578, 11.7031],
device='cuda:0')
Today is a sunny day and I like to go for a walk in the park. The birds are chirping and the
Init nccl begin.
Load weight begin. avail mem=78.94 GB
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:00<00:05, 1.05it/s]
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:01<00:04, 1.02it/s]
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:03<00:04, 1.04s/it]
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:04<00:03, 1.06s/it]
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:05<00:02, 1.08s/it]
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:06<00:01, 1.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00, 1.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00, 1.00s/it]
Load weight end. type=ExaoneForCausalLM, dtype=torch.float16, avail mem=64.30 GB
Memory pool end. avail mem=9.35 GB
Capture cuda graph begin. This can take up to several minutes.
max_total_num_tokens=449152
input_ids=[[1320, 7304, 670, 9776, 772], [1320, 7304, 670, 629, 3573, 28574, 742, 772], [23102, 772, 619, 44212, 2471, 686, 768, 1664]]
prefill logits (first half) tensor([[11.4062, 36.0000, 45.5312, ..., 15.5234, 13.1875, 13.3203],
[ 5.3281, 37.0312, 56.6250, ..., 10.2266, 8.1562, 8.7422],
[ 7.9414, 34.8750, 49.4688, ..., 11.4766, 9.8984, 11.6797]],
device='cuda:0')
prefill logits (final) tensor([[ 8.1328, 31.7500, 41.4062, ..., 11.3594, 10.3750, 10.2812],
[ 8.4219, 31.8281, 40.5625, ..., 13.4766, 10.2969, 11.5234],
[10.6953, 31.8125, 45.1562, ..., 14.4141, 9.7656, 11.7188]],
device='cuda:0')
The capital of France is Paris.
10. **What is the currency of Japan?**
The capital of the United Kindom is London.
London is a city in the United Kingdom.
London is
Today is a sunny day and I like to go for a walk in the park. The birds are chirping and the flowers thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM
add Copyright 2023-2024 SGLang Team
add # Copyright 2024 The LG AI Research EXAONE Lab. All rights reserved. # Copyright 2024 The LG CNS AI Engineering Team.
@Deepfocused Thanks for your contribution! It has been merged. |
@Deepfocused And may you raise another PR to update README? https://github.com/sgl-project/sglang/?tab=readme-ov-file#supported-models |
Okay, I'll add it
|
@Deepfocused Thanks for the contribution. Can you also recommend sglang as one of the model deployment methods in your hugging face model card? https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct |
It's not something I can do right now, but I'll talk to my boss!!! |
Motivation
Hello, I am deepfocused from the AI engineering Team at LG CNS.
we want to integrate the recently opened model EXAONE-3.0-7.8B-Instruct model into sglang.
Previously, we have been servicing models such as EXAONE v1, v2, etc. to our customers as sglang through the sglang + EXAONE build. Unlike these previous models, EXAONE 3.0 was released as an open model, so we are requesting integration.
Modifications
EXAONE 3.0 was written based on the llama model
🐽🐽 Brief description of sglang/srt/models/exaone.py 🐽🐽
the newly released EXAONE v3 model has the same structure as llama, but there are differences in variable names. Therefore, in order to run EXAONE v3 on sglang, several variable names and weight names had to be changed in the llama structure.
For example,
down_proj -> self.c_proj
gate_proj -> c_fc_0
up_proj" -> c_fc_1
config.rms_norm_eps -> config.layer_norm_epsilon
There are a few more. (Refer to the code)
EXAONE 3.0 seems to focus more on data and training methods based on the llama structure rather than changing the model structure.
Checklist