-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port storage handling + wrapper materialization from CUDA.jl. #468
Conversation
I've opened a PR that migrates AMDGPU to GPUArray's storage handling: JuliaGPU/AMDGPU.jl#416 There's one suggestion. pxl-th@Leleka:~/.julia/dev/AMDGPU$ JULIA_DEBUG=AMDGPU amdjl
julia> using AMDGPU
julia> x = ROCArray(ones(Float32, 16));
┌ Debug: Allocating 64 bytes from ROCMemoryPool @ 0x00000000025531a0: Segment global, Flags (coarsegrained), Size 11.984 GiB (11.984 GiB max allocation), Runtime Alloc: true (4.000 KiB granularity, 4.000 KiB alignment), All Accessible: false
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:260
┌ Debug: Allocation phase 1: HSA_STATUS_SUCCESS
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:295
julia> y = reshape(x, 4, 4);
julia> AMDGPU.unsafe_free!(x)
julia> AMDGPU.unsafe_free!(y)
┌ Debug: Freed 64 bytes @ Ptr{Nothing} @0x00007fd820200000 ptr from pool
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:366 Currently, AMDGPU has this implemented. |
That's on purpose. It would be very dangerous if
If |
Indeed. However, if some underlying function (from the library you are not in control) derives from your input argument, you are left at the mercy of GC. Hence, AMDGPU began to free immediately. function whatever(a)
b = view(a, :)
# use b
return something
end
a = ...
something = whatever(a)
synchronize()
unsafe_free!(a) |
For example, this will not free julia> x = AMDGPU.rand(Float32, 16, 2);
julia> m = mean(x; dims=1);
julia> y = x .- m;
julia> AMDGPU.unsafe_free!(m) |
Well, the previous situation was unsafe, where you could derive a
Yeah, invalidly so, since there was still a reachable Finally, all this shouldn't really cause an OOM if AMDGPU.jl calls |
It does. AMDGPU has allocation/retry mechanism. |
Here's where that happens: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/5e08dcc5159325dfef5a48c2668b06fd6fef0469/src/core/runtime/amd_gpu_agent.cpp#L1329 Performing allocations only on the Julia side does not lead to crashes. for i in 1_000_000
ROCArray{Float32}(undef, 1024 * 1024)
end But if there are kernel launches that use non-zero amount of scratch memory, then at some point OOM will happen on scratch allocation side and not on Julia side. |
IMO, users should be aware if they are still using derived arrays before calling |
Well, then AMDGPU.jl should check before launching kernels that sufficient memory is available. Or something like that, I'm playing armchair expert here. I agree that the whole interaction between GPU allocations and the GC is suboptimal, especially in the context of external allocations, but that's hardly an argument to make a proposed API way too overeager for most use cases (and IMO just wrong).
I disagree.
I disagree. Users should not have to think about that. |
Again, I'm sympathetic to the view, but I think it would make @jpsamaroo Would it be possible to avoid these crashes on external allocations differently? Can you cheaply check for the free memory? At some point, in CUDA.jl we kept track of allocated memory so that we could check & free cheaply (in our case, checked on every allocation). |
But they do have to think when they are performing operations on derived arrays which modify all instances that share the underlying data.
HSA has added an API for checking the amount of free memory only in ROCm 5.3+ (which also includes currently allocated scratch memory). Alternatively, AMDGPU also keeps the count of all allocated memory, but that will be optimistic. |
But those operations can be read-only, in which case you can safely I wouldn't be opposed to adding, say, a keyword argument to |
Having a way to force it via flag would be good. |
I think we can go as is with this PR. And this PR removes quite a bit of lines, so good to have :) |
Updated, and updated back-ends. @pxl-th Anything that needs to happen here to make AMDGPU.jl work? |
I'm almost sure that no, but I'll double check tomorrow. |
I'll merge and tag already so that I can move forwards, but do let me know if there are any issues. |
All works fine, thanks! |
Tagged as 9.0! |
This PR ports 3 pieces of functionality from CUDA.jl:
ArrayStorage
in CUDA.jl) so that we can create array objects that share data;Motivation
The goal of these changes is to make CUDA.jl's array wrapper optimization available for all back-ends. It consists of avoiding common array wrappers, as they complicate wrappers (necessitating complex Union-typed arguments). It's not great that we have to do this, but sadly that's the reality with Julia's array types; extending every method to use a Dense array alias that covers contiguous views/reshapes/reinterprets is just not practical, leading to ambiguities and slow method insertion times. Essentially, the following Union will now just be represented by the array type itself:
API changes
To make this possible, array back-ends will need to wrap their data in a
DataRef
and callGPUArrays.unsafe_free!
instead of the actual finalizer, which is now registered with theDataRef
:To get a hold of the underlying data, call
getindex
, i.e.,managed_buf[]
. To create a new reference that shares the underlying data, callcopy(managed_buf)
.The second big part is a new interface method,
GPUArrays.derive(::Type{T}, N::Int, a::AbstractGPUArray, osize::Dims, offset::Int)
. This method is used by the view/reshape/reinterpret functionality to create a derived array, so this method will need to callcopy(::DataRef)
. Also note the newoffset
parameter, which will need to be added to the array structure, keeping track of the offset (in number of elements) for the purpose of implementing views. A simple implementation of this method could be:Fixes JuliaGPU/Metal.jl#149
cc @jpsamaroo