Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(ascend): add Dockerfile for ascend aarch64 910B #2278

Merged
merged 4 commits into from
Aug 28, 2024

Conversation

CyCle1024
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Providing a Dockerfile for running ascend backends with pytorch engine,
Currently only Dockerfile for aarch64 platform is prepared.

Modification

Add Dockerfile for ascend aarch64 910B

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@huyz-git
Copy link

huyz-git commented Aug 12, 2024

I got this error when trying to import torch_dipu inside the container:

ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack

The CANN used in the container is 8.0.RC3.alpha001

@CyCle1024
Copy link
Collaborator Author

I got this error when trying to import torch_dipu inside the container:

ImportError: /deeplink/deeplink.framework/dipu/torch_dipu/libtorch_dipu.so: undefined symbol: aclprofSetStampCallStack

The CANN used in the container is 8.0.RC3.alpha001

deeplink.framework supports 8.0.RC1.alpha003,other versions are not tested for now.

@yunfwe
Copy link

yunfwe commented Aug 15, 2024

构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch
得到的报错:
image
但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。

@CyCle1024
Copy link
Collaborator Author

构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错: image 但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。

目前ascend平台支持的模型不包括Qwen2-7B-Instruct,并且api_server尚未支持设置输入device_type参数以选择ascend后端。

@CyCle1024
Copy link
Collaborator Author

CyCle1024 commented Aug 15, 2024

构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错: image 但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b,可以参考以下脚本进行静态的推理,chat版本的功能还在开发中:

import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食?"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

@yunfwe
Copy link

yunfwe commented Aug 15, 2024

构建好docker镜像后运行:lmdeploy serve api_server Qwen2-7B-Instruct --backend pytorch 得到的报错: image 但是triton这个库在aarch64上没有提供预编译好的包,自行编译也失败了。

@yunfwe 目前支持的模型为 llama2-7b, internlm2-7b, mixtral-8x7b,可以参考以下脚本进行静态的推理,chat版本的功能还在开发中:

import deeplink_ext
import lmdeploy
from lmdeploy import PytorchEngineConfig

if __name__ == "__main__":
    backend_config = PytorchEngineConfig(tp=1, cache_max_entry_count=0.3,
                                         device_type="ascend")
    pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b",
                             backend_config=backend_config)
    question = ["上海有什么美食?"]
    response = pipe(question, request_output_len=128, do_preprocess=True)
    for idx, r in enumerate(response):
        print(f"Question: {question[idx]}")
        print(f"Answer: {r.text}")
        print()

感谢解惑

@lvhan028
Copy link
Collaborator

@RunningLeon may open another PR to add device_type in CLI

@RunningLeon
Copy link
Collaborator

@RunningLeon may open another PR to add device_type in CLI

OK

@huyz-git
Copy link

huyz-git commented Aug 15, 2024

Using CANN version 8.0.RC1.alpha003 I can successfully run the container.
However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

Prefill performance, token/s Decode performance, token/s
batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200
input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens
lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441
MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442

The model is Yi-1.5-6B-Chat.

@jinminxi104
Copy link
Collaborator

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?

Prefill performance, token/s Decode performance, token/s
batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200
input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens
lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441
MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442
The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

@dengyingxu
Copy link

dengyingxu commented Aug 16, 2024

image 请问,有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛? 我使用的是RC1,910B2C

@CyCle1024
Copy link
Collaborator Author

image 请问,有人遇到ValueError: xpu is not available, you should use device="cpu" instead的错误嘛? 我使用的是RC1,910B2C

能否附上测试脚本test_deploy.py?

pip3 install pathlib2 protobuf attrs attr scipy && \
pip3 install requests psutil absl-py && \
pip3 install torch==2.1.1 torchvision==0.16.1 --index-url=https://download.pytorch.org/whl/cpu && \
pip3 install transformers==4.38.0 && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to specify the version of transformers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to specify the version of transformers?

we only test specific version, and maybe it can relax to 4.38.0-4.41.2, we should test

Comment on lines +119 to +122
RUN echo -e "diff --git a/impl/ascend_npu/CMakeLists.txt b/impl/ascend_npu/CMakeLists.txt\n\
index e684c59..f1cd8d4 100755\n\
--- a/impl/ascend_npu/CMakeLists.txt\n\
+++ b/impl/ascend_npu/CMakeLists.txt\n\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning patch, as you mentioned before such warnings can be misunderstood by users as errors.

index e684c59..f1cd8d4 100755\n\
--- a/impl/ascend_npu/CMakeLists.txt\n\
+++ b/impl/ascend_npu/CMakeLists.txt\n\
@@ -14,6 +14,11 @@ FetchContent_Declare(op_plugin\n\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this in a cmakelist file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch can't be merged into main branch, so we use this method to work around.

sed -i 's@http://mirrors.tuna.tsinghua.edu.cn@https://mirrors.tuna.tsinghua.edu.cn@g' /etc/apt/sources.list && \
apt clean && rm -rf /var/lib/apt/lists/*

RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why install both gcc-7 and gcc-9

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gcov necessary?

Copy link
Collaborator Author

@CyCle1024 CyCle1024 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why install both gcc-7 and gcc-9

OS embedded gcc is 9.4.0, but due to some strange compiler error, we must use gcc 7.5.0 for deeplink.framework. This update-alternatives command is a proper way to use gcc 7.5.0 default and remain gcc 9.4.0 available for OS.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gcov necessary?

I think gcov is not necessary, but the piece of code just makes gcc toolchain version to be same. update-alternatives only controls different versions of same programs. I refer update-alternatives manual here.

@lvhan028 lvhan028 merged commit d04b37f into InternLM:main Aug 28, 2024
3 checks passed
@yunyu-Mr
Copy link

Any image on Docker Hub?

@CyCle1024 CyCle1024 deleted the add_ascend_dockerfile branch October 17, 2024 03:59
@sisrfeng
Copy link

sisrfeng commented Nov 4, 2024

Using CANN version 8.0.RC1.alpha003 I can successfully run the container. However, after I modify the device_type parameter and let lmdeploy run API server on ascend backend, I got extremely slow inference speed compared to Ascend MindIE, is this normal?
Prefill performance, token/s Decode performance, token/s
batch_size = 1 / 10 / 50 batch_size = 1 / 10 / 100 / 200
input = 1000 tokens, output = 1 token input = 1 token, output = 100 tokens
lmdeploy 1238 / 1693 / 1837 15 / 131 / 454 / 441
MindIE 11458 / 18956 / 20061 68 / 643 / 4435 / 4442
The model is Yi-1.5-6B-Chat.

The current version is slower than MindIE. It is based on eager mode and is not fully optimized (If you have a Huawei machine with an Intel CPU, you can get 3x performance without any changes.) MindIE is based on graph mode, so it shows better performance. We are working on graph mode and will release the graph mode version of 910b on lmdeploy by the end of October.

Will LMDeploy become a competitor to MindIE? As a user of Ascend 910B, which inference and serving engine should I chose?
Related issue:
vllm-project/vllm#8054 (comment)

@jinminxi104
Copy link
Collaborator

Any image on Docker Hub?

No, please use the dockerfile. (some compliance reasons..)

@jinminxi104
Copy link
Collaborator

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

@huyz-git
Copy link

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

image

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

@jinminxi104
Copy link
Collaborator

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:

image

For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.

Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing.
Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...)
for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue.
for 4 npus testing with small batch size, we will analyze the gap to MindIE.
Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output.
(our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

@huyz-git
Copy link

huyz-git commented Nov 22, 2024

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:
image
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...) for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

  • I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
  • After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
  • After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
  • When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
  • After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

  • First I open a terminal and send a stream request using curl, and the server will stuck for a while.
  • After that request finishes, re-send a request and it is processed normally.
  • Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

@jinminxi104
Copy link
Collaborator

Will LMDeploy become a competitor to MindIE?

Yes, we have graph mode, and capture graph via torch.dynamo.

I tested the performance of the graph mode:
image
For single request, the graph mode's speed is much closer to MindIE compared to the eager mode. However, for batched requests, the graph mode's speed is still far lower than MindIE.
Also, for the prefill stage with batched requests, the graph mode's speed is even slower than the eager mode.

Thanks for your testing. Assuming you have a KUNPENG cpu, here is my response: (Intel CPU will be a totally different story...) for 1 NPU testing, it is meet our expect. Large batch size hurts performance since kunpeng cpu shows a bad performance on detokenize. we are working on this issue. for 4 npus testing with small batch size, we will analyze the gap to MindIE. Acctually, I read my mindie code(I am not sure the code on my side is the same code you have), it has simpler post-processing, no dynamic memory allocation, and no streaming output. (our latest release is lmdeploy 0.6.3, dlinfer-ascend 0.1.2)

Thanks for reply.

During testing, I also found a strange behavior: the graph mode will sometimes stuck for a while, with CPU single core 100% but nearly no NPU core usage.

Specifically, after the inference server started:

  • I test decode phase first. First I test batch size 1, and then the server will stuck for a while, I have to re-test to get the expected result
  • After that, I test batch size 10, and the server will also stuck for a while, I also have to re-test.
  • After that, the test of batch size 100 and 200 is normal, the server will not stuck anymore.
  • When the test of decode phase finishes, I begin to test prefill phase. The situation is somehow similar to the decode phase. First I test batch size 1 and the server will stuck for a while.
  • After that, the test of batch size 10 and 50 is normal.

This phenomenon also happens in normal usage. After the server started:

  • First I open a terminal and send a stream request using curl, and the server will stuck for a while.
  • After that request finishes, re-send a request and it is processed normally.
  • Then I open two terminals, send a stream request with a large output tokens, when the stream started but not finished, I send another stream request in another terminal. Now the first stream stucks, and after a while it resumes and the second stream starts.

Eager mode does not have such phenomenon.

Is this a bug or expected behavior?

sorry for my late respose.
The "stuck for a while" is called warmup which only occurs in graph mode. In warmup phase, pytorch code is compiled for calling ascend toolkits. After warmup phase, we call compiled funciton to boost performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants