diff --git a/README.md b/README.md
index 91ac1d4134..61c0eba45b 100644
--- a/README.md
+++ b/README.md
@@ -115,6 +115,7 @@ For detailed inference benchmarks in more devices and more settings, please refe
Llama2 (7B - 70B)
Llama3 (8B, 70B)
Llama3.1 (8B, 70B)
+ Llama3.2 (1B, 3B)
InternLM (7B - 20B)
InternLM2 (7B - 20B)
InternLM2.5 (7B)
diff --git a/README_ja.md b/README_ja.md
index ea4480e282..999ebc9f0b 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -114,6 +114,7 @@ LMDeploy TurboMindエンジンは卓越した推論能力を持ち、さまざ
Llama2 (7B - 70B)
Llama3 (8B, 70B)
Llama3.1 (8B, 70B)
+ Llama3.2 (1B, 3B)
InternLM (7B - 20B)
InternLM2 (7B - 20B)
InternLM2.5 (7B)
diff --git a/README_zh-CN.md b/README_zh-CN.md
index cdddb64a22..f002899c60 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -116,6 +116,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
Llama2 (7B - 70B)
Llama3 (8B, 70B)
Llama3.1 (8B, 70B)
+ Llama3.2 (1B, 3B)
InternLM (7B - 20B)
InternLM2 (7B - 20B)
InternLM2.5 (7B)
diff --git a/docs/en/quantization/w4a16.md b/docs/en/quantization/w4a16.md
index 3a04cd7b05..32dfe18d80 100644
--- a/docs/en/quantization/w4a16.md
+++ b/docs/en/quantization/w4a16.md
@@ -69,7 +69,7 @@ lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server_name {ip_addr} --serve
## Evaluation
-Please refer to [OpenCompass](https://opencompass.readthedocs.io/en/latest/index.html) about model evaluation with LMDeploy.
+Please refer to [OpenCompass](https://opencompass.readthedocs.io/en/latest/index.html) about model evaluation with LMDeploy. Here is the [guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lmdeploy.html)
## Inference
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
index c992f730c8..260120efe0 100644
--- a/docs/en/supported_models/supported_models.md
+++ b/docs/en/supported_models/supported_models.md
@@ -10,6 +10,7 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
+| Llama3.2 | 3B | LLM | Yes | Yes | Yes | Yes |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -20,7 +21,6 @@ The following tables detail the models supported by LMDeploy's TurboMind engine
| Qwen2 | 1.5B - 72B | LLM | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | - |
| Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
-| Qwen2-VL | 2B, 7B, 72B | MLLM | Yes | Yes | Yes | - |
| DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| Baichuan | 7B | LLM | Yes | Yes | Yes | Yes |
| Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -49,6 +49,7 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - |
+| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2-VL | 8B, 90B | MLLM | Yes | Yes | Yes | No | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
diff --git a/docs/zh_cn/quantization/w4a16.md b/docs/zh_cn/quantization/w4a16.md
index b61b894781..d50e464af3 100644
--- a/docs/zh_cn/quantization/w4a16.md
+++ b/docs/zh_cn/quantization/w4a16.md
@@ -72,7 +72,7 @@ lmdeploy serve gradio ./internlm2_5-7b-chat-4bit --server-name {ip_addr} --serve
## 模型评测
-我们使用 [OpenCompass](https://opencompass.readthedocs.io/zh-cn/latest/index.html) 评测量化模型在各个维度上的能力
+我们使用 [OpenCompass](https://opencompass.readthedocs.io/zh-cn/latest/index.html) 评测量化模型在各个维度上的能力。方法请参考[此处](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/evaluation_lmdeploy.html)
## 模型推理
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
index 695103b52e..26930cf3ce 100644
--- a/docs/zh_cn/supported_models/supported_models.md
+++ b/docs/zh_cn/supported_models/supported_models.md
@@ -10,6 +10,7 @@
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | Yes |
+| Llama3.2 | 3B | LLM | Yes | Yes | Yes | Yes |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes |
| InternLM2.5 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -20,7 +21,6 @@
| Qwen2 | 1.5B - 72B | LLM | Yes | Yes | Yes | Yes |
| Mistral | 7B | LLM | Yes | Yes | Yes | - |
| Qwen-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
-| Qwen2-VL | 2B, 7B, 72B | MLLM | Yes | Yes | Yes | - |
| DeepSeek-VL | 7B | MLLM | Yes | Yes | Yes | Yes |
| Baichuan | 7B | LLM | Yes | Yes | Yes | Yes |
| Baichuan2 | 7B | LLM | Yes | Yes | Yes | Yes |
@@ -49,6 +49,7 @@ turbomind 引擎不支持 window attention。所以,对于应用了 window att
| Llama2 | 7B - 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3 | 8B, 70B | LLM | Yes | Yes | Yes | Yes | Yes |
| Llama3.1 | 8B, 70B | LLM | Yes | Yes | Yes | No | - |
+| Llama3.2 | 1B, 3B | LLM | Yes | Yes | Yes | No | - |
| Llama3.2-VL | 8B, 90B | MLLM | Yes | Yes | Yes | No | - |
| InternLM | 7B - 20B | LLM | Yes | Yes | Yes | Yes | - |
| InternLM2 | 7B - 20B | LLM | Yes | Yes | Yes | Yes | Yes |
diff --git a/lmdeploy/model.py b/lmdeploy/model.py
index f251ca18d2..26ab856bc2 100644
--- a/lmdeploy/model.py
+++ b/lmdeploy/model.py
@@ -772,7 +772,7 @@ def match(cls, model_path: str) -> Optional[str]:
return 'llama3'
-@MODELS.register_module(name='llama3_1')
+@MODELS.register_module(name=['llama3_1', 'llama3_2'])
class Llama3_1(Llama3):
"""Chat template of LLaMA3.1 model."""
diff --git a/lmdeploy/turbomind/supported_models.py b/lmdeploy/turbomind/supported_models.py
index 8ebb93fdf2..bdf129b019 100644
--- a/lmdeploy/turbomind/supported_models.py
+++ b/lmdeploy/turbomind/supported_models.py
@@ -84,9 +84,9 @@ def _is_head_dim_128(cfg):
if num_attn_head == 40:
# baichuan-13B, baichuan2-13B not supported by turbomind
support_by_turbomind = False
- elif arch == 'Qwen2ForCausalLM':
- # qwen2 0.5b size_per_head is 64, which hasn't been supported
- # by turbomind yet
+ elif arch in ['Qwen2ForCausalLM', 'LlamaForCausalLM']:
+ # the head_dim of qwen2 0.5b and llama3.2-1b is 64, which
+ # hasn't been supported by turbomind yet
support_by_turbomind = _is_head_dim_128(cfg)
elif arch in ('ChatGLMModel', 'ChatGLMForConditionalGeneration'):
# chatglm1/2/3 is not working yet