Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Windows CI CUDA Intermittent error C2993 #17935

Open
ChaiBapchya opened this issue Mar 30, 2020 · 6 comments
Open

Windows CI CUDA Intermittent error C2993 #17935

ChaiBapchya opened this issue Mar 30, 2020 · 6 comments

Comments

@ChaiBapchya
Copy link
Contributor

Description

Intermittent failure seen on windows-gpu compilation phase (WIN_GPU/WIN_GPU_MKLDNN)

Discovered in this PR : #17808

Related to pytorch/pytorch#25393

Error Message

It intermittently gives the error :

C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal

Errors:

[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_value_type<T>' being compiled
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2065: 'U1': undeclared identifier
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2238: unexpected token(s) preceding ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2988: unrecognizable template declaration/definition
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: '<end Parse>'

Entire stack trace:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/15/nodes/39/log/?start=0

To Reproduce

Build using Windows AMI and run
Clone repo &
py -3 ci/build_windows.py -f WIN_GPU

What have you tried to solve it?

  1. Use cuda 10.2 instead of 9.2
  2. Updated VS2019
  3. Add cmake flag : /Zc:__cplusplus

Currently, what is found to work:
Introduced max retries = 5

@ChaiBapchya
Copy link
Contributor Author

@mxnet-label-bot add [ci, windows]

@ChaiBapchya ChaiBapchya changed the title Windows CI CUDA Intermitted error C2993 Windows CI CUDA Intermittent error C2993 Apr 2, 2020
@leezu
Copy link
Contributor

leezu commented Apr 4, 2020

Created an upstream issue: NVIDIA/thrust#1090

@leezu
Copy link
Contributor

leezu commented May 1, 2020

@vexilligera did you test if the error also occurs on more recent versions of thrust? I suggest we try installing thrust 1.9.8 version on Windows CI, which is the version that'll be shipped with Cuda 11

We do that on Ubuntu CI already

https://github.com/apache/incubator-mxnet/blob/76fa58373636c57fee1e4e6cd7960723b39f455f/ci/docker/Dockerfile.build.ubuntu#L144-L150

@leezu
Copy link
Contributor

leezu commented May 1, 2020

There is another suggested fix at pytorch/pytorch#25393 (comment)

cc @vexilligera

@leezu
Copy link
Contributor

leezu commented May 9, 2020

Seems to be a nvcc bug NVIDIA/thrust#1090 (comment)

@alliepiper
Copy link

This is indeed an nvcc bug. There is no known workaround at the moment, but the next release of the CUDA toolkit will contain a fix.

Ref NVIDIA/thrust#1090.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants