build(ascend): add Dockerfile for ascend aarch64 910B #2278

CyCle1024 · 2024-08-09T11:31:26Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Providing a Dockerfile for running ascend backends with pytorch engine,
Currently only Dockerfile for aarch64 platform is prepared.

Modification

Add Dockerfile for ascend aarch64 910B

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

huyz-git · 2024-08-12T02:40:31Z

I got this error when trying to import torch_dipu inside the container:

ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack

The CANN used in the container is 8.0.RC3.alpha001

CyCle1024 · 2024-08-12T08:15:49Z

I got this error when trying to import torch_dipu inside the container:
ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack
The CANN used in the container is 8.0.RC3.alpha001

deeplink.framework supports 8.0.RC1.alpha003，other versions are not tested for now.

yunfwe · 2024-08-15T02:34:08Z

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch
得到的报错：

但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

CyCle1024 · 2024-08-15T03:39:17Z

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

目前ascend平台支持的模型不包括Qwen2-7B-Instruct，并且api_server尚未支持设置输入device_type参数以选择ascend后端。

CyCle1024 · 2024-08-15T05:34:17Z

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b，可以参考以下脚本进行静态的推理，chat版本的功能还在开发中：

import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食？"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

yunfwe · 2024-08-15T06:14:23Z

构建好docker镜像后运行：lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错：但是triton这个库在aarch64上没有提供预编译好的包，自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b，可以参考以下脚本进行静态的推理，chat版本的功能还在开发中：

import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食？"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

感谢解惑

lvhan028 · 2024-08-15T06:31:07Z

@RunningLeon may open another PR to add device_type in CLI

RunningLeon · 2024-08-15T06:57:12Z

@RunningLeon may open another PR to add device_type in CLI

OK

huyz-git · 2024-08-15T09:13:23Z

Using CANN version 8.0.RC1.alpha003 I can successfully run the container.
However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

	Prefill performance, token/s	Decode performance, token/s
	batch_size = 1 / 10 / 50	batch_size = 1 / 10 / 100 / 200
	input = 1000 tokens, output = 1 token	input = 1 token, output = 100 tokens
lmdeploy	1238 / 1693 / 1837	15 / 131 / 454 / 441
MindIE	11458 / 18956 / 20061	68 / 643 / 4435 / 4442

The model is Yi-1.5-6B-Chat.

jinminxi104 · 2024-08-15T15:14:04Z

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

Prefill performance, token/s Decode performance, token/s
batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200
input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens
lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441
MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442
The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

dengyingxu · 2024-08-16T09:07:55Z

请问，有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛？我使用的是RC1，910B2C

CyCle1024 · 2024-08-19T05:21:51Z

请问，有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛？我使用的是RC1，910B2C

能否附上测试脚本test_deploy.py？

lvhan028 · 2024-08-22T07:46:49Z

docker/Dockerfile_aarch64_910B

+    pip3 install pathlib2 protobuf attrs attr scipy && \
+    pip3 install requests psutil absl-py && \
+    pip3 install torch==2.1.1 torchvision==0.16.1 --index-url=https://download.pytorch.org/whl/cpu && \
+    pip3 install transformers==4.38.0 && \


Do we have to specify the version of transformers?

Do we have to specify the version of transformers?

we only test specific version, and maybe it can relax to 4.38.0-4.41.2, we should test

lvhan028 · 2024-08-22T07:52:15Z

docker/Dockerfile_aarch64_910B

+RUN echo -e "diff --git a/impl/ascend_npu/CMakeLists.txt b/impl/ascend_npu/CMakeLists.txt\n\
+index e684c59..f1cd8d4 100755\n\
+--- a/impl/ascend_npu/CMakeLists.txt\n\
+++ b/impl/ascend_npu/CMakeLists.txt\n\


What's this for?

Warning patch, as you mentioned before such warnings can be misunderstood by users as errors.

lvhan028 · 2024-08-22T07:52:56Z

docker/Dockerfile_aarch64_910B

+index e684c59..f1cd8d4 100755\n\
+--- a/impl/ascend_npu/CMakeLists.txt\n\
+++ b/impl/ascend_npu/CMakeLists.txt\n\
+@@ -14,6 +14,11 @@ FetchContent_Declare(op_plugin\n\


Why not put this in a cmakelist file?

This patch can't be merged into main branch, so we use this method to work around.

lvhan028 · 2024-08-22T07:56:20Z

docker/Dockerfile_aarch64_910B

+    sed -i 's@http://mirrors.tuna.tsinghua.edu.cn@https://mirrors.tuna.tsinghua.edu.cn@g' /etc/apt/sources.list && \
+    apt clean && rm -rf /var/lib/apt/lists/*
+
+RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7 \


Why install both gcc-7 and gcc-9

Is gcov necessary?

Why install both gcc-7 and gcc-9

OS embedded gcc is 9.4.0, but due to some strange compiler error, we must use gcc 7.5.0 for deeplink.framework. This update-alternatives command is a proper way to use gcc 7.5.0 default and remain gcc 9.4.0 available for OS.

Is gcov necessary?

I think gcov is not necessary, but the piece of code just makes gcc toolchain version to be same. update-alternatives only controls different versions of same programs. I refer update-alternatives manual here.

yunyu-Mr · 2024-09-13T09:38:56Z

Any image on Docker Hub？

sisrfeng · 2024-11-04T08:09:38Z

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?
Prefill performance, token/s Decode performance, token/s
batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200
input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens
lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441
MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442
The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

Will LMDeploy become a competitor to MindIE? As a user of Ascend 910B, which inference and serving engine should I chose?
Related issue:
vllm-project/vllm#8054 (comment)

jinminxi104 · 2024-11-20T07:11:35Z

Any image on Docker Hub？

No, please use the dockerfile. (some compliance reasons..)

jinminxi104 · 2024-11-20T08:00:18Z

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

huyz-git · 2024-11-21T03:21:45Z

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

jinminxi104 · 2024-11-21T11:11:26Z

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing.
Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...）
for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue.
for 4 npus testing with small batch size, we will analyze the gap to MindIE.
Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output.
(our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

huyz-git · 2024-11-22T01:31:03Z

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...） for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

First I open a terminal and send a stream request using curl, and the server will stuck for a while.
After that request finishes, re-send a request and it is processed normally.
Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

jinminxi104 · 2024-12-18T08:53:49Z

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: （Intel CPU will be a totally different story...） for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result

After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.

After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.

When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.

After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

First I open a terminal and send a stream request using curl, and the server will stuck for a while.

After that request finishes, re-send a request and it is processed normally.

Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

sorry for my late respose.
The "stuck for a while" is called warmup which only occurs in graph mode. In warmup phase, pytorch code is compiled for calling ascend toolkits. After warmup phase, we call compiled funciton to boost performance.

CyCle1024 added 2 commits August 9, 2024 19:26

build(ascend): add Dockerfile for ascend aarch64 910B

2f2c6e6

ci(ascend): skip ascend dockerfile codespell check

2ef0f14

lvhan028 added the enhancement New feature or request label Aug 12, 2024

lvhan028 mentioned this pull request Aug 12, 2024

请问一下国产显卡Ascend 910 and Hygon DCU Z100L等如何安装？ #2279

Closed

3 tasks

lvhan028 requested a review from grimoire August 14, 2024 04:31

lvhan028 reviewed Aug 22, 2024

View reviewed changes

lvhan028 approved these changes Aug 27, 2024

View reviewed changes

CyCle1024 added 2 commits August 28, 2024 11:42

update lmdeploy tag and transformers version in Dockerfile_aarch_910B

55a2f94

Merge branch 'main' into add_ascend_dockerfile

8a3f056

grimoire approved these changes Aug 28, 2024

View reviewed changes

lvhan028 merged commit d04b37f into InternLM:main Aug 28, 2024
3 checks passed

CyCle1024 deleted the add_ascend_dockerfile branch October 17, 2024 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(ascend): add Dockerfile for ascend aarch64 910B #2278

build(ascend): add Dockerfile for ascend aarch64 910B #2278

CyCle1024 commented Aug 9, 2024

huyz-git commented Aug 12, 2024 •

edited

Loading

CyCle1024 commented Aug 12, 2024

yunfwe commented Aug 15, 2024

CyCle1024 commented Aug 15, 2024

CyCle1024 commented Aug 15, 2024 •

edited

Loading

yunfwe commented Aug 15, 2024

lvhan028 commented Aug 15, 2024

RunningLeon commented Aug 15, 2024

huyz-git commented Aug 15, 2024 •

edited

Loading

jinminxi104 commented Aug 15, 2024

dengyingxu commented Aug 16, 2024 •

edited

Loading

CyCle1024 commented Aug 19, 2024

lvhan028 Aug 22, 2024

CyCle1024 Aug 23, 2024

lvhan028 Aug 22, 2024

CyCle1024 Aug 23, 2024

lvhan028 Aug 22, 2024

CyCle1024 Aug 23, 2024

lvhan028 Aug 22, 2024

lvhan028 Aug 22, 2024

CyCle1024 Aug 23, 2024 •

edited

Loading

CyCle1024 Aug 23, 2024

yunyu-Mr commented Sep 13, 2024

sisrfeng commented Nov 4, 2024 •

edited

Loading

jinminxi104 commented Nov 20, 2024

jinminxi104 commented Nov 20, 2024

huyz-git commented Nov 21, 2024

jinminxi104 commented Nov 21, 2024

huyz-git commented Nov 22, 2024 •

edited

Loading

jinminxi104 commented Dec 18, 2024

build(ascend): add Dockerfile for ascend aarch64 910B #2278

build(ascend): add Dockerfile for ascend aarch64 910B #2278

Conversation

CyCle1024 commented Aug 9, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

huyz-git commented Aug 12, 2024 • edited Loading

CyCle1024 commented Aug 12, 2024

yunfwe commented Aug 15, 2024

CyCle1024 commented Aug 15, 2024

CyCle1024 commented Aug 15, 2024 • edited Loading

yunfwe commented Aug 15, 2024

lvhan028 commented Aug 15, 2024

RunningLeon commented Aug 15, 2024

huyz-git commented Aug 15, 2024 • edited Loading

jinminxi104 commented Aug 15, 2024

dengyingxu commented Aug 16, 2024 • edited Loading

CyCle1024 commented Aug 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyCle1024 Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yunyu-Mr commented Sep 13, 2024

sisrfeng commented Nov 4, 2024 • edited Loading

jinminxi104 commented Nov 20, 2024

jinminxi104 commented Nov 20, 2024

huyz-git commented Nov 21, 2024

jinminxi104 commented Nov 21, 2024

huyz-git commented Nov 22, 2024 • edited Loading

jinminxi104 commented Dec 18, 2024

huyz-git commented Aug 12, 2024 •

edited

Loading

CyCle1024 commented Aug 15, 2024 •

edited

Loading

huyz-git commented Aug 15, 2024 •

edited

Loading

dengyingxu commented Aug 16, 2024 •

edited

Loading

CyCle1024 Aug 23, 2024 •

edited

Loading

sisrfeng commented Nov 4, 2024 •

edited

Loading

huyz-git commented Nov 22, 2024 •

edited

Loading