Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch_npu support aclnn and add op #2998

Merged
merged 5 commits into from
Jan 7, 2024
Merged

Conversation

momo609
Copy link
Collaborator

@momo609 momo609 commented Nov 29, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects, like MMDet or MMCls.
  • CLA has been signed and all committers have signed the CLA in this PR.

@CLAassistant
Copy link

CLAassistant commented Dec 28, 2023

CLA assistant check
All committers have signed the CLA.

@zhouzaida zhouzaida linked an issue Dec 28, 2023 that may be closed by this pull request
2 tasks
@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致
In file included from /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0,
from /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1:
/home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp: In function ‘void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)’:
/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40: error: ‘utils’ is not a member of ‘torch_npu’
at::TensorOptions(torch_npu::utils::get_npu_device_type());
^
/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40: note: in definition of macro ‘EXEC_NPU_CMD’
at::TensorOptions(torch_npu::utils::get_npu_device_type());
^~~~~
/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40: note: suggested alternatives:
at::TensorOptions(torch_npu::utils::get_npu_device_type());
^
/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40: note: in definition of macro ‘EXEC_NPU_CMD’
at::TensorOptions(torch_npu::utils::get_npu_device_type()); \

setup.py Outdated
@@ -397,12 +397,21 @@ def get_mluops_version(file_path):
elif (os.getenv('FORCE_NPU', '0') == '1'):
print(f'Compiling {ext_name} only with CPU and NPU')
try:
import imp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imp module is deprecated in favor of importlib. Please use importlib.

[[0.0900, 0.4900, 0.4900, 0.0900], [0.0900, 0.4900, 0.4900, 0.0900],
[0.7200, 0.8500, 0.4900, 0.3600]],
device='cuda')
def torch_type_trans(dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def torch_type_trans(dtype):
def torch_to_np_type(dtype):

[[1.6, 9.99], [2.3, 9.99], [2.3, 10.39], [1.6, 10.39]]],
device='cuda',
requires_grad=True)
def chamfer_distance_forward_gloden(xyz1, xyz2, dtype):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def chamfer_distance_forward_gloden(xyz1, xyz2, dtype):
def chamfer_distance_forward_groundtruth(xyz1, xyz2, dtype):

(bs, ns, 2)).astype(torch_type_trans(dtype))
xyz1_npu = torch.tensor(xyz1, dtype=dtype).to(device)
xyz2_npu = torch.tensor(xyz2, dtype=dtype).to(device)
expected_output = chamfer_distance_forward_gloden(xyz1, xyz2, dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected_output = chamfer_distance_forward_gloden(xyz1, xyz2, dtype)
expected_output = chamfer_distance_forward_groundtruth(xyz1, xyz2, dtype)

Comment on lines 62 to 64
(bs, ns, 2)).astype(torch_type_trans(dtype))
xyz2 = np.random.uniform(-10.0, 10.0,
(bs, ns, 2)).astype(torch_type_trans(dtype))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(bs, ns, 2)).astype(torch_type_trans(dtype))
xyz2 = np.random.uniform(-10.0, 10.0,
(bs, ns, 2)).astype(torch_type_trans(dtype))
(bs, ns, 2)).astype(torch_to_np_type(dtype))
xyz2 = np.random.uniform(-10.0, 10.0,
(bs, ns, 2)).astype(torch_to_np_type(dtype))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好,我在使用mmocr的时候遇到了mmcv的算子不在NPU支持的问题,import mmcv & import mmcv.ops都没有问题,也测试了mmocr的几个模型例如dpnet,master,fcenet等都能正常训练,但是涉及到mmcv ops的模型就会报错,具体有sdmgr(RuntimeError: roi_align_forward_impl: implementation for device xla:1 not found.) drrg(RuntimeError: roi_align_rotated_forward_impl: implementation for device xla:1 not found.)以及mask-rcnn(RuntimeError: nms_impl: implementation for device xla:1 not found.)请问该如何解决呢

@chekistcccp
Copy link

在相同环境下测试仍然出现同样报错

@chekistcccp
Copy link

chekistcccp commented Dec 29, 2023

mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:22处建议改为
#include </usr/local/Ascend/ascend-toolkit/latest/runtime/include/acl/acl_base.h>
mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:23处建议改为
#include </usr/local/Ascend/ascend-toolkit/latest/runtime/include/acl/acl_rt.h>
编译时经常出现提示找不到文件,改为绝对路径后问题不再出现

@momo609
Copy link
Collaborator Author

momo609 commented Jan 2, 2024

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

@momo609
Copy link
Collaborator Author

momo609 commented Jan 4, 2024

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

您这边测试使用的什么环境?我这边尝试建立一下

@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

我这边使用了pytorch:2.0.1-CANN6.3.RC2-py39,torch-npu版本为2.0.1rc1,仍然报相同错误

@momo609
Copy link
Collaborator Author

momo609 commented Jan 4, 2024

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

我这边使用了pytorch:2.0.1-CANN6.3.RC2-py39,torch-npu版本为2.0.1rc1,仍然报相同错误

使用CANN7.1.0rc4版本,torch-npu版本为最新的1.11.0,建议使用最近的torch_npu和cann包

@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

我这边使用了pytorch:2.0.1-CANN6.3.RC2-py39,torch-npu版本为2.0.1rc1,仍然报相同错误

使用CANN7.1.0rc4版本,torch-npu版本为最新的1.11.0,建议使用最近的torch_npu和cann包

您这边是使用专门的镜像么,我这边是通过juypterlab进行操作的,无法自行升级

@chekistcccp
Copy link

测试时报错,信息如下,环境与issue#3002一致 In 文件包含在/home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_helper.hpp:26:0, 来自/home/ma- user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:1: /home/ma-user/work/mmcv/mmcv/ops/csrc/pytorch/npu/chamfer_distance_npu.cpp:在函数' void chamfer_distance_backward_npu(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor)': /home/ ma- user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:错误:“utils”不是“torch_npu”的成员 :::TensorOptions(torch_npu::utils::get_npu_device_type()) ; ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^~~~~ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:建议的替代方案: at::TensorOptions(torch_npu::utils::get_npu_device_type( )); ^ /home/ma-user/work/mmcv/mmcv/ops/csrc/common/pytorch_npu_util.hpp:555:40:注意:在宏“EXEC_NPU_CMD”的定义中 :::TensorOptions(torch_npu::utils::get_npu_device_type( ));\

在相同环境下测试仍然出现同样报错

你好,请问这边的环境是什么版本和什么日期的torch_npu?

pytorch版本为1.11.0,CANN版本为6.3.2,python环境为py_3.7,OS euler_2.8.3-aarch64,torch-npu版本为1.11.0.post1.dev20230719

可以升级CANN和torch版本来解决问题

我这边使用了pytorch:2.0.1-CANN6.3.RC2-py39,torch-npu版本为2.0.1rc1,仍然报相同错误

使用CANN7.1.0rc4版本,torch-npu版本为最新的1.11.0,建议使用最近的torch_npu和cann包

目前找不到CANN7.1.0rc4版本,请指下出处

@zhouzaida zhouzaida merged commit c7c02a7 into open-mmlab:main Jan 7, 2024
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] 无法在华为Ascend 910上编译安装MMCV
6 participants