Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cquery-ing with cuda targets #209

Merged
merged 8 commits into from
Jan 25, 2024

Conversation

mvukov
Copy link
Contributor

@mvukov mvukov commented Jan 3, 2024

This makes possible to do the following:

cd examples
bazel cquery //if_cuda/... --@rules_cuda//cuda:enable=False

when one wants to run cquery (e.g. within bazel-diff) in an env without cuda (e.g. a CI).

I read the readme of if_cuda example. It's advertised that it's OK that bazel errors out if one want to build a cuda target when rules_cuda is disabled. I'd say that's a drastic measure. Folks should IMO mark their bazel targets like I did in this PR.

Copy link
Collaborator

@cloudhan cloudhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, looks good to me.

cuda/private/toolchain_configs/dummy.bzl Outdated Show resolved Hide resolved
@cloudhan
Copy link
Collaborator

cloudhan commented Jan 4, 2024

It somehow don't work with bzlmod:

> bazel cquery //if_cuda/... --@rules_cuda//cuda:enable=False --enable_bzlmod
ERROR: no such package '@@[unknown repo 'platforms' requested from @@]//': The repository '@@[unknown repo 'platforms' requested from @@]' could not be resolved: No repository visible as '@platforms' from main repository
ERROR: Analysis of target '//if_cuda:kernel' failed; build aborted: no such package '@@[unknown repo 'platforms' requested from @@]//': The repository '@@[unknown repo 'platforms' requested from @@]' could not be resolved: No repository visible as '@platforms' from main repository
ERROR: Build did NOT complete successfully

@mvukov
Copy link
Contributor Author

mvukov commented Jan 4, 2024

  • Fixed bzlmod for the disabled toolchain.
  • Updated docs

Also, handled the case when cuda is enabled but cuda toolkit cannot be found. The previous behavior was that bazel build would fail because some targets are not present in @local_cuda//*, and thats true because @local_cuda//:BUILD was empty. This was not quite obvious/ergonomic unless you would dig into the code and found out that the BUILD file was empty. I refactored the code such that when we want to build a cuda target as e.g.

bazel build //if_cuda:kernel --@rules_cuda//cuda:enable=true

but there is no cuda toolkit available we get

ERROR: Target //if_cuda:kernel is incompatible and cannot be built, but was explicitly requested.
Dependency chain:
    //if_cuda:kernel (9867ef)   <-- target platform (@local_config_platform//:host) didn't satisfy constraint @platforms//:incompatible
FAILED: Build did NOT complete successfully (37 packages loaded, 138 targets configured)

This is IMO more readable, with addition of the docs for requires_cuda helper.

Thinking a bit further, it makes more sense to use globally @rules_cuda//cuda:is_enabled_and_cuda_found instead of @rules_cuda//cuda:is_enabled. But I'd like to hear your opinion on this. I'm open for different naming of is_enabled_and_cuda_found :) Maybe we can just rename is_enabled -> _is_enabled and is_enabled_and_cuda_found -> is_enabled?

@mvukov
Copy link
Contributor Author

mvukov commented Jan 4, 2024

@cloudhan PTAL.

@@ -0,0 +1,16 @@
"""private defs"""

def requires_cuda():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken and adapted from https://github.com/tensorflow/runtime, there used to be rules_cuda repo in there.


cuda_library(
name = "kernel",
srcs = ["kernel.cu"],
hdrs = ["kernel.h"],
target_compatible_with = requires_cuda(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side note: users could still get non-so-obvious errors from bazel in case the target compatibility is not set.

cuda/private/defs.bzl Outdated Show resolved Hide resolved
@cloudhan
Copy link
Collaborator

cloudhan commented Jan 4, 2024

Thinking a bit further, it makes more sense to use globally @rules_cuda//cuda:is_enabled_and_cuda_found instead of @rules_cuda//cuda:is_enabled. But I'd like to hear your opinion on this.

I'd prefer not to add it. Because it somehow to me, it mixed the target and host, the enable generally means the bazel build invocation generate targets that is able to run cuda code (so you enable it). The toolchain part, however, should be regarded as the host where bazel runs is able to produce that target.

In the case of the toolchain being mis-configured or not configured, it is a host_compatiable_with thing (the attribute does not exist!).

@jsharpe Any comment?

@mvukov
Copy link
Contributor Author

mvukov commented Jan 4, 2024

I looked into how "@platforms//:incompatible" is defined and made a change in d72a951.

bazel build //if_cuda:kernel   --@rules_cuda//cuda:enable=false
ERROR: Target //if_cuda:kernel is incompatible and cannot be built, but was explicitly requested.
Dependency chain:
    //if_cuda:kernel (6b3a99)   <-- target platform (@local_config_platform//:host) didn't satisfy constraints [@rules_cuda//cuda:cuda_must_be_enabled, @rules_cuda//cuda:cuda_must_be_found]
FAILED: Build did NOT complete successfully (35 packages loaded, 139 targets configured)

The second constraint is a bit misleading because we didn't even look for it. Open to suggestions.

bazel build //if_cuda:kernel   --@rules_cuda//cuda:enable=true
ERROR: Target //if_cuda:kernel is incompatible and cannot be built, but was explicitly requested.
Dependency chain:
    //if_cuda:kernel (9867ef)   <-- target platform (@local_config_platform//:host) didn't satisfy constraint @rules_cuda//cuda:cuda_must_be_found
FAILED: Build did NOT complete successfully (0 packages loaded, 137 targets configured)

For the reference: ⬆️ is for the case cuda is enabled but cuda toolkit is not found.

@cloudhan PTAL.

@mvukov
Copy link
Contributor Author

mvukov commented Jan 16, 2024

@cloudhan friendly reminder: PTAL. I implemented changes you requested.

@cloudhan
Copy link
Collaborator

Sorry for the delay, was dealing with the CI update for Bazel 7 support.

@cloudhan
Copy link
Collaborator

Let me play around with it to see if we can simplify it a little bit.

@cloudhan
Copy link
Collaborator

cloudhan commented Jan 17, 2024

To be honest, what is dragging me down in these changes is the latter maneuver that tries to hide error info

... No matching toolchains found for types @rules_cuda//cuda:toolchain_type.
To debug, rerun with --toolchain_resolution_debug='@rules_cuda//cuda:toolchain_type'

It doesn't change the situation (that toolchain is not found), but add another layer of indirection.

IIRC the initial change fixed cquery-ing error and looks good...

@mvukov
Copy link
Contributor Author

mvukov commented Jan 17, 2024

I don't know what to do about your last comment. Can you propose an alternative? As-is, folks can't use cquery when there is no cuda toolchain present. This PR fixes that situation and eventually gives a more user-friendly failure message IMO.

@mvukov
Copy link
Contributor Author

mvukov commented Jan 17, 2024

Are you not happy about requires_cuda() macro logic but all the rest seems OK? If so, I can remove that macro and I'll have that in my monorepo.

@cloudhan
Copy link
Collaborator

The _requires_cuda_found() part (aka, the constraint on cuda toolkit is found) . It hides the unresolved toolchain error with a custom constraint. Thus the builtin No matching toolchains found for types @rules_cuda//cuda:toolchain_type error is substituted with didn't satisfy constraints ... @rules_cuda//cuda:cuda_must_be_found

Better split it into a seperate PR? I might ask some bazel toolchain experts on what is the best practice for it...

cuda/BUILD.bazel Outdated
Comment on lines 94 to 107

constraint_setting(name = "cuda_must_be_enabled_setting")

constraint_value(
name = "cuda_must_be_enabled",
constraint_setting = ":cuda_must_be_enabled_setting",
)

constraint_setting(name = "cuda_must_be_found_setting")

constraint_value(
name = "cuda_must_be_found",
constraint_setting = ":cuda_must_be_found_setting",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you split this PR, please also address the naming

In the cuda namespace (//cuda:), you don't repeat cuda, then you can see //cuda:must_be_enabled and //cuda:must_be_found are extremely confusing. I'd suggest renaming these as rules_are_enabled... and valid_toolchain_is_configured or something the like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed naming.

Comment on lines 18 to 19
* CUDA is enabled and
* CUDA toolchain is found.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    * rules are enabled and
    * toolchain is found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

the conditions are not satisfied. Incompatible targets are excluded
from bazel target wildcards and fail to build if requested explicitly.
"""
return _requires_is_enabled() + _requires_cuda_found()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mvukov
Copy link
Contributor Author

mvukov commented Jan 19, 2024

The _requires_cuda_found() part (aka, the constraint on cuda toolkit is found) . It hides the unresolved toolchain error with a custom constraint. Thus the builtin No matching toolchains found for types @rules_cuda//cuda:toolchain_type error is substituted with didn't satisfy constraints ... @rules_cuda//cuda:cuda_must_be_found

So, with this PR, the unresolved toolchain error is gone, because a dummy one is actually registered. If we remove target_compatible_with = requires_cuda(), from //if_cuda:kernel, then we get now:

bazel build //if_cuda:kernel 
Starting local Bazel server and connecting to it...
ERROR: /home/milan/.cache/bazel/_bazel_milan/d4ba7c041b1396ebf4926b6ac8cc612b/external/rules_cuda/cuda/BUILD.bazel:90:11: every rule of type label_flag implicitly depends upon the target '@local_cuda//:cuda_runtime', but this target could not be found because of: no such target '@local_cuda//:cuda_runtime': target 'cuda_runtime' not declared in package '' defined by /home/milan/.cache/bazel/_bazel_milan/d4ba7c041b1396ebf4926b6ac8cc612b/external/local_cuda/BUILD (Tip: use `query "@local_cuda//:*"` to see all the targets in that package)
ERROR: Analysis of target '//if_cuda:kernel' failed; build aborted: 
FAILED: Build did NOT complete successfully (40 packages loaded, 139 targets configured)

if cuda toolchain is not found. Moreover, cquery still doesn't work:

bazel cquery //if_cuda:*
ERROR: /home/milan/.cache/bazel/_bazel_milan/d4ba7c041b1396ebf4926b6ac8cc612b/external/rules_cuda/cuda/BUILD.bazel:90:11: every rule of type label_flag implicitly depends upon the target '@local_cuda//:cuda_runtime', but this target could not be found because of: no such target '@local_cuda//:cuda_runtime': target 'cuda_runtime' not declared in package '' defined by /home/milan/.cache/bazel/_bazel_milan/d4ba7c041b1396ebf4926b6ac8cc612b/external/local_cuda/BUILD (Tip: use `query "@local_cuda//:*"` to see all the targets in that package)
ERROR: Analysis of target '//if_cuda:kernel' failed; build aborted: 
FAILED: Build did NOT complete successfully (0 packages loaded, 2 targets configured)

So, for cquery to work, one must use target_compatible_with (IMO, this is good practice).

I made an update to this PR such that both cquery work + that we get IMO more dev-friendly errors than @local_cuda//:cuda_runtime is missing.

WDYT? Let's first sort this out and then I can address your remarks.

@cloudhan
Copy link
Collaborator

cloudhan commented Jan 19, 2024

OK, I see your point now. The situation is caused by the ":disabled-local-toolchain" being relying on the target_settings = [":cuda_is_disabled"],. I tried to make it always available as a bottom case, but then the toolchain will ask for a compiler_executable from the toolchain config.

The compiler_executable is a string (mirrored from CcToolchainInfo), whereas it should have been a Label so that we can build a dummy one from dummy.cc, I am not sure if this is possible tho...

Lets adopt the current change to fix the issue. Please address naming tho.

@mvukov
Copy link
Contributor Author

mvukov commented Jan 25, 2024

@cloudhan PTAL, I addressed your comments.

@cloudhan cloudhan merged commit 3f24292 into bazel-contrib:main Jan 25, 2024
14 checks passed
anakinxc referenced this pull request in secretflow/spu Aug 14, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [rules_cuda](https://github.com/bazel-contrib/rules_cuda) |
http_archive | patch | `v0.2.1` -> `v0.2.2` |

---

### Release Notes

<details>
<summary>bazel-contrib/rules_cuda (rules_cuda)</summary>

###
[`v0.2.2`](https://github.com/bazel-contrib/rules_cuda/releases/tag/v0.2.2)

[Compare
Source](https://github.com/bazel-contrib/rules_cuda/compare/v0.2.1...v0.2.2)

#### `WORKSPACE` code

```starlark
load("@&#8203;bazel_tools//tools/build_defs/repo:http.bzl", "http_archive")
http_archive(
    name = "rules_cuda",
    sha256 = "b066750579f33e93e9dc55b8ee2067b525d863c1ddcf09b47a6332c39f0701fb",
    strip_prefix = "rules_cuda-v0.2.2",
    urls = ["https://github.com/bazel-contrib/rules_cuda/releases/download/v0.2.2/rules_cuda-v0.2.2.tar.gz"],
)

load("@&#8203;rules_cuda//cuda:repositories.bzl", "register_detected_cuda_toolchains", "rules_cuda_dependencies")
rules_cuda_dependencies()
register_detected_cuda_toolchains()
```

#### What's Changed

- Filter attrs properly for cuda_test by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/180](https://github.com/bazel-contrib/rules_cuda/pull/180)
- Fix cuda_test by [@&#8203;hofbi](https://github.com/hofbi) in
[https://github.com/bazel-contrib/rules_cuda/pull/181](https://github.com/bazel-contrib/rules_cuda/pull/181)
- Add v0.2.1 to docs by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/184](https://github.com/bazel-contrib/rules_cuda/pull/184)
- Pass through `--sysroot` to host compiler by
[@&#8203;lalten](https://github.com/lalten) in
[https://github.com/bazel-contrib/rules_cuda/pull/185](https://github.com/bazel-contrib/rules_cuda/pull/185)
- Add cuda_binary macro by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/186](https://github.com/bazel-contrib/rules_cuda/pull/186)
- Move non glob files out of glob by
[@&#8203;hofbi](https://github.com/hofbi) in
[https://github.com/bazel-contrib/rules_cuda/pull/192](https://github.com/bazel-contrib/rules_cuda/pull/192)
- Disallow autoupdate nccl by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/193](https://github.com/bazel-contrib/rules_cuda/pull/193)
- eliminate cpu architecture constraint for clang by
[@&#8203;dmellosanjay](https://github.com/dmellosanjay) in
[https://github.com/bazel-contrib/rules_cuda/pull/208](https://github.com/bazel-contrib/rules_cuda/pull/208)
- Check nvcc version before adding `--dopt on` flags by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/212](https://github.com/bazel-contrib/rules_cuda/pull/212)
- Add alwayslink to cuda_binary and cuda_test macros by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/210](https://github.com/bazel-contrib/rules_cuda/pull/210)
- Add additional tests for LTS releases by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/215](https://github.com/bazel-contrib/rules_cuda/pull/215)
- Fix cquery-ing with cuda targets by
[@&#8203;mvukov](https://github.com/mvukov) in
[https://github.com/bazel-contrib/rules_cuda/pull/209](https://github.com/bazel-contrib/rules_cuda/pull/209)
- Propose new solution for know issue (nvcc filesystem race condition)
by [@&#8203;hofbi](https://github.com/hofbi) in
[https://github.com/bazel-contrib/rules_cuda/pull/216](https://github.com/bazel-contrib/rules_cuda/pull/216)
- Fix a typo in `if_cuda` doc by
[@&#8203;rygx](https://github.com/rygx) in
[https://github.com/bazel-contrib/rules_cuda/pull/222](https://github.com/bazel-contrib/rules_cuda/pull/222)
- Ignore MODULE.bazel.lock file by
[@&#8203;rygx](https://github.com/rygx) in
[https://github.com/bazel-contrib/rules_cuda/pull/224](https://github.com/bazel-contrib/rules_cuda/pull/224)
- ci: avoid nvcc /tmp race condition by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/232](https://github.com/bazel-contrib/rules_cuda/pull/232)
- ci: disable doc test workflow cache to avoid excessive space wasting
by [@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/231](https://github.com/bazel-contrib/rules_cuda/pull/231)
- feat: Add features for compiling with -arch=all or -arch=all-major by
[@&#8203;jsharpe](https://github.com/jsharpe) in
[https://github.com/bazel-contrib/rules_cuda/pull/245](https://github.com/bazel-contrib/rules_cuda/pull/245)
- Fix spelling by [@&#8203;Vertexwahn](https://github.com/Vertexwahn)
in
[https://github.com/bazel-contrib/rules_cuda/pull/250](https://github.com/bazel-contrib/rules_cuda/pull/250)
- Change example rules_cuda version by
[@&#8203;Vertexwahn](https://github.com/Vertexwahn) in
[https://github.com/bazel-contrib/rules_cuda/pull/249](https://github.com/bazel-contrib/rules_cuda/pull/249)
- Document how to use rules_cuda with Bzlmod by
[@&#8203;Vertexwahn](https://github.com/Vertexwahn) in
[https://github.com/bazel-contrib/rules_cuda/pull/252](https://github.com/bazel-contrib/rules_cuda/pull/252)
- Do not assume libcupti.so location by
[@&#8203;tyb0807](https://github.com/tyb0807) in
[https://github.com/bazel-contrib/rules_cuda/pull/253](https://github.com/bazel-contrib/rules_cuda/pull/253)
- ci: use absolute path for XDG_CACHE_HOME as github actions and bazel
doesn't resolve `~` automatically by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/260](https://github.com/bazel-contrib/rules_cuda/pull/260)
- ci: cover major bazel releases in utilities tests by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/262](https://github.com/bazel-contrib/rules_cuda/pull/262)
- test: workaround label resolving with bzlmod by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/263](https://github.com/bazel-contrib/rules_cuda/pull/263)
- fix(bzlmod): allow both root module and our module to call
cuda.local_toolchain by
[@&#8203;cloudhan](https://github.com/cloudhan) in
[https://github.com/bazel-contrib/rules_cuda/pull/264](https://github.com/bazel-contrib/rules_cuda/pull/264)

#### New Contributors

- [@&#8203;dmellosanjay](https://github.com/dmellosanjay) made their
first contribution in
[https://github.com/bazel-contrib/rules_cuda/pull/208](https://github.com/bazel-contrib/rules_cuda/pull/208)
- [@&#8203;mvukov](https://github.com/mvukov) made their first
contribution in
[https://github.com/bazel-contrib/rules_cuda/pull/209](https://github.com/bazel-contrib/rules_cuda/pull/209)
- [@&#8203;rygx](https://github.com/rygx) made their first
contribution in
[https://github.com/bazel-contrib/rules_cuda/pull/222](https://github.com/bazel-contrib/rules_cuda/pull/222)
- [@&#8203;Vertexwahn](https://github.com/Vertexwahn) made their first
contribution in
[https://github.com/bazel-contrib/rules_cuda/pull/250](https://github.com/bazel-contrib/rules_cuda/pull/250)
- [@&#8203;tyb0807](https://github.com/tyb0807) made their first
contribution in
[https://github.com/bazel-contrib/rules_cuda/pull/253](https://github.com/bazel-contrib/rules_cuda/pull/253)

**Full Changelog**:
bazel-contrib/rules_cuda@v0.2.1...v0.2.2

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View the
[repository job log](https://developer.mend.io/github/secretflow/spu).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOC4yMC4xIiwidXBkYXRlZEluVmVyIjoiMzguMjYuMSIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiZGVwZW5kZW5jaWVzIl19-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants