Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release v0.6.4

09 Dec 12:08
14b64c7
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • disable prefix-caching for vl model by @grimoire in #2825
  • Fix gemma2 accuracy through the correct softcapping logic by @AllentDan in #2842
  • fix accessing before initialization by @lvhan028 in #2845
  • fix the logic to verify whether AutoAWQ has been successfully installed by @grimoire in #2844
  • check whether backend_config is None or not before accessing its attr by @lvhan028 in #2848
  • [ascend] convert kv cache to nd format in ascend graph mode by @tangzhiyi11 in #2853

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.6.3...v0.6.4

LMDeploy Release V0.6.3

16 Nov 04:31
0c80baa
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.6.2...v0.6.3

LMDeploy Release v0.6.2.post1

07 Nov 07:41
4fc9479
Compare
Choose a tag to compare

What's Changed

Bugs

🌐 Other

Full Changelog: v0.6.2...v0.6.2.post1

LMDeploy Release v0.6.2

29 Oct 06:42
522108c
Compare
Choose a tag to compare

Highlights

  • PyTorch engine supports graph mode on ascend platform, doubling the inference speed
  • Support llama3.2-vision models in PyTorch engine
  • Support Mixtral in TurboMind engine, achieving 20+ RPS using SharedGPT dataset with 2 A100-80G GPUs

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.6.1...v0.6.2

LMDeploy Release V0.6.1

28 Sep 11:34
2e49fc3
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Full Changelog: v0.6.0...v0.6.1

LMDeploy Release v0.6.0

13 Sep 03:12
e2aa4bd
Compare
Choose a tag to compare

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Refactor PytorchEngine
    • Employ CUDA graph to boost the inference performance (30%)
    • Support more models in Huawei Ascend platform
  • Upgrade GenerationConfig
    • Support min_p sampling
    • Add do_sample=False as the default option
    • Remove EngineGenerationConfig and merge it to GenertionConfig
  • Support guided decoding
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
    Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

Break Changes

  • TurboMind model converter. Please re-convert the models if you uses this feature
  • EngineGenerationConfig is removed. Please use GenerationConfig instead
  • Chat template. Please use --chat-template to specify it

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
  • fix cache position for pytorch engine by @RunningLeon in #2388
  • Fix /v1/completions batch order wrong by @AllentDan in #2395
  • Fix some issues encountered by modelscope and community by @irexyc in #2428
  • fix llama3 rotary in pytorch engine by @grimoire in #2444
  • fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
  • fix MultinomialSampling operator builder by @grimoire in #2460
  • Fix initialization of runtime_min_p by @irexyc in #2461
  • fix Windows compile error by @zhyncs in #2303
  • fix: follow up #2303 by @zhyncs in #2307

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0

LMDeploy Release V0.6.0a0

26 Aug 09:12
97b880b
Compare
Choose a tag to compare

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Optimize the prefilling inference stage of PyTorchEngine
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json 

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0a0

LMDeploy Release V0.5.3

07 Aug 03:38
a129a14
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.2...v0.5.3

LMDeploy Release V0.5.2.post1

26 Jul 12:22
fb6f8ea
Compare
Choose a tag to compare

What's Changed

🐞 Bug fixes

  • [Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157

🌐 Other

Full Changelog: v0.5.2...v0.5.2.post1

LMDeploy Release V0.5.2

26 Jul 08:07
7199b4e
Compare
Choose a tag to compare

Highlight

  • LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

💥 Improvements

  • Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
  • Remove kv cache offline quantization by @AllentDan in #2097
  • Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
  • clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

🌐 Other

Full Changelog: v0.5.1...v0.5.2