Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda] Add CUDA version check #3249

Merged
merged 4 commits into from
Oct 24, 2021
Merged

[cuda] Add CUDA version check #3249

merged 4 commits into from
Oct 24, 2021

Conversation

0xzhang
Copy link
Collaborator

@0xzhang 0xzhang commented Oct 22, 2021

Related issue = fixes #3195

Issue
CUDADriver have two behaviours:

  1. Load library;
  2. Load CUDA functions.

In the past, error info throwed when loading function failure. Because lack of CUDA version's check. It assumes loading CUDA functions will be success.

After fix this, the behaviour of with_cuda will consistent with with_opengl. If cuda is not supported, arch will falling back to cpu directly and don't throw error info like before.

Implementation
The way I implemented because,

  1. Version checking should be done after the library has been successfully loaded.
  2. Version checking should be done before many CUDA functions are loaded.
  3. The lower version library can load successfully, but the version check failed and the function cannot be loaded.
  4. driver_get_version() is a one of the CUDA functions in file cuda_driver_functions.inc.h. Therefore I extracted this function from the file.
  5. detected() will be called twice. First time, it used to check status of library loading when construct CUDADriver. Second time, it checks whether CUDA driver is ready. Thus, I set a member flag cuda_version_valid initialized with true. When construct CUDADriver, this flag is senseless. After instance constructed, the return value of detected() will dependent on actual CUDA driver status.

@netlify
Copy link

netlify bot commented Oct 22, 2021

✔️ Deploy Preview for jovial-fermat-aa59dc canceled.

🔨 Explore the source changes: 5c4ac74

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/61744aaaa833ef000799ae9d

@0xzhang
Copy link
Collaborator Author

0xzhang commented Oct 22, 2021

arch_table = {
cuda: _ti_core.with_cuda,
metal: _ti_core.with_metal,
opengl: _ti_core.with_opengl,
cc: _ti_core.with_cc,
vulkan: lambda: _ti_core.with_vulkan(),
wasm: lambda: True,
cpu: lambda: True,
}
with_arch = arch_table.get(arch, lambda: False)
try:
return with_arch()
except Exception as e:
arch = _ti_core.arch_name(arch)
_ti_core.warn(
f"{e.__class__.__name__}: '{e}' occurred when detecting "
f"{arch}, consider add `export TI_WITH_{arch.upper()}=0` "
f" to environment variables to depress this warning message.")
return False

For current archs, env variables TI_ENABLE_METAL/TI_ENABLE_OPENGL/TI_ENABLE_CUDA are avaliable. Env variable are not considered in arch cc and vulkan's implementation. They only use C++ macro TI_WITH_CC and TI_WITH_VULKAN to decide return value.

Currently, error info in file __init__.py is only work for metal, opengl and cuda. Support of env variables TI_ENABLE_VULKAN/TI_ENABLE_CC for vulkan and cc may be need to be implemented.

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I only got a few nits

@@ -1004,7 +1004,7 @@ def is_arch_supported(arch):
arch = _ti_core.arch_name(arch)
_ti_core.warn(
f"{e.__class__.__name__}: '{e}' occurred when detecting "
f"{arch}, consider add `export TI_WITH_{arch.upper()}=0` "
f"{arch}, consider add `TI_ENABLE_{arch.upper()}=0` "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"{arch}, consider add `TI_ENABLE_{arch.upper()}=0` "
f"{arch}, consider adding `TI_ENABLE_{arch.upper()}=0` "

@@ -1004,7 +1004,7 @@ def is_arch_supported(arch):
arch = _ti_core.arch_name(arch)
_ti_core.warn(
f"{e.__class__.__name__}: '{e}' occurred when detecting "
f"{arch}, consider add `export TI_WITH_{arch.upper()}=0` "
f"{arch}, consider add `TI_ENABLE_{arch.upper()}=0` "
f" to environment variables to depress this warning message.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// not your fault :-)

Suggested change
f" to environment variables to depress this warning message.")
f" to environment variables to suppress this warning message.")

// CUDA versions should >= 10.
if (version < 10000) {
cuda_version_valid = false;
TI_WARN(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's show a bit more information here:

TI_WARN("The Taichi CUDA backend requires at least CUDA 10.0, got {}", version);

@@ -119,6 +121,8 @@ class CUDADriver {
std::unique_ptr<DynamicLoader> loader_;

std::mutex lock_;

bool cuda_version_valid = true;
Copy link
Member

@k-ye k-ye Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, our CPP guide is at https://docs.taichi.graphics/lang/articles/contribution/cpp_style

Suggested change
bool cuda_version_valid = true;
bool cuda_version_valid_{false};

Copy link
Collaborator Author

@0xzhang 0xzhang Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I use true value to initialize.
😂 It's tricky. But I consider use this way to solves the problem.

The return value of detected() dependent on CUDA version and library loading together.

  • If failed to load library. cuda_version_valid still be true. But always return false.
  • If successed to load library. cuda_version_valid will be set according actual status. And return value is dependent on cuda_version_valid.

It's all because the function detected() be first called when construct singleton instance CUDADriver. 😂

Maybe I could decoupling detected() from CUDADriver constructor. The value of cuda_version_valid will be seemed more safer.

Copy link
Member

@k-ye k-ye Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation!

The fact that detect() is called during initialization makes things a bit tricky -- Its value is supposed to be constant throughout the lifetime of CUDADriver. To improve the situation a bit, we can do this:

  1. Cache the result of TI_ENABLE_CUDA in another bool member
  2. Don't call detect() inside the ctor.

So the pseudo code can be something like this:

class CUDADriver {
 public:
  CUDADriver() {
    disabled_by_env_ = (get_environ_config("TI_ENABLE_CUDA", 1) == 0);
    if (disabled_by_env_) {
      // return directly
    }
    loader_ = ...;
    if (!loader_->loaded()) {
      // return directly
    }
    // Read version from the loader
    if (version < 10000) {
      cuda_version_valid_ = false;
      // rteturn directly
    }
  }

  bool detected() {
    return !disabled_by_env_ && valid_cuda_version_ && loader_->loaded();
  }
 private:
  bool disabled_by_env_{false};
  bool valid_cuda_version_{false};
};

Copy link
Collaborator Author

@0xzhang 0xzhang Oct 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much!!
I changed the implementation.

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@k-ye k-ye merged commit 7787245 into taichi-dev:master Oct 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error info about export TI_WITH_CUDA=0 maybe inappropriate.
2 participants