-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ITensors] Fix broken broadcast operation on GPU #1532
Conversation
Hi @emstoudenmire and @abouco, I was able to make a minimal example of the issue. The code that is causing an issue is here Line 417 in 1ef12d6
|
@kmp5VT I remember we previously faced issues with the GPU compiler struggling with certain ITensor broadcast operations that were written in terms of anonymous functions/closures, for example see the changes to It seems like you were able to do pretty modest rewrites of some of the ITensor broadcast functions to circumvent those issues but I don't remember the details, perhaps something similar can be done here. One of my suggestions at the time of #1194 was that you can always rewrite a closure as a callable struct (see the Julia docs on closures: https://docs.julialang.org/en/v1/devdocs/functions/#Closures), which may be easier for the GPU compiler to compile, if there isn't a simpler way to refactor the code to help the GPU compiler. |
@mtfishman Thank you for the suggestion, I know we had made a PR to fix broadcasts before and it was helpful to have that as a reference. I tried the pattern we used in #1194 but the GPU compiler still complains that the values |
I don't think we should split it into two lines because that is less performant (I expect what you wrote is roughly 2 times slower than the original version), so you will have to find another approach, whether that is making a struct that implements the closure or something else. |
Worst case, we can have two code branches, one for CPU that leaves the code as it is and one for GPU that splits the broadcast call into two calls, but hopefully there is a way to rewrite it so that it involves just one broadcast call and works on CPU and GPU. |
Here is a demonstration that your new version will be slower (by roughly a factor of 1.5 in this case): function f1!(T, A, α, β)
T .= β .* T .+ α .* A
return T
end
function f2!(T, A, α, β)
T .= β .* T
T .+= α .* A
return T
end
T = randn(10^8)
A = randn(10^8)
α = 2.0
β = 3.0
using BenchmarkTools: @btime
T1 = copy(T)
@btime f1!($T1, $A, $α, $β)
T2 = copy(T)
@btime f2!($T2, $A, $α, $β) which outputs: 27.130 ms (0 allocations: 0 bytes)
42.336 ms (0 allocations: 0 bytes) It doesn't use more memory since it reuses the existing allocated memory, but does require more FLOPS since it isn't fusing all of the operations into a single loop. |
@mtfishman This makes sense as every element in |
@kmp5VT it looks like the test failures are caused by Julia 1.11, which was just released and our tests automatically started testing against that. Let's fix those issues in a separate PR from this one, and this and other open PRs can be based on those fixes once it is merged. The The Julia 1.10 became the LTS (https://discourse.julialang.org/t/julia-v1-11-0-has-been-released-and-v1-10-is-now-lts/121064) so we can also change that to the oldest Julia version we support and just test against Julia 1.10 and 1.11, but that should probably be done in a PR of its own since that is something we will want to announce to users (I don't think it needs a breaking release of ITensor packages to change that but I have to think about that...). |
Oh I see, I was running on my mac with metal and see that the tests are passing. I am making an issue as a reminder to update our tests.
Okay I can try to do this as well in another PR. I can also open an issue as a reminder.
This is very useful information, thanks! |
@mtfishman yes the original issue is fixed now. It looks like there is a test here in the ITensor test suite but the issue was related to using GPU's with ITensor. Currently we are using GPUs in the ITensors suite so I am not really sure where to put a test. Do you have a suggestion? |
Thanks for the reminder that we're not running the
|
Co-authored-by: Matt Fishman <mtfishman@users.noreply.github.com>
Co-authored-by: Matt Fishman <mtfishman@users.noreply.github.com>
Note that the original issue may be related to more general issues that closures can have in Julia:
i.e. closures can have some type stability issues in certain cases, and certain GPU backends may have trouble compiling functions with type instabilities. Just pointing that out to potentially give some explanation and broader context for the original issue. Using a |
Description
Aleix has found an issue which arises from the use of VectorInterface with GPU based ITensors. The problem is a broadcast function
a.= a .* l .+ b .* k
is being called and the anonymous function being generated is not able to be parsed on GPU. One way I could see to fix this is to call
permute_dims!!
alternatively I could split the function into two broadcasts. Right now I have a patch in VectorInterfaces. Below I have a pseudocode that can be added to ITensors and tested using JLArrays when we have that integrated.Simple Example
Checklist:
broadcast
functionITensorVectorInterfaceExt
library which can be tested later with JLArrays