Port storage handling + wrapper materialization from CUDA.jl. #468

maleadt · 2023-04-11T19:09:54Z

This PR ports 3 pieces of functionality from CUDA.jl:

a buffer refcount type (ArrayStorage in CUDA.jl) so that we can create array objects that share data;
using that, functionality to represent simple views/reshapes/reinterprets without using Base wrappers;
and unrelated, reusable functionality for reporting on unsupported element types.

Motivation

The goal of these changes is to make CUDA.jl's array wrapper optimization available for all back-ends. It consists of avoiding common array wrappers, as they complicate wrappers (necessitating complex Union-typed arguments). It's not great that we have to do this, but sadly that's the reality with Julia's array types; extending every method to use a Dense array alias that covers contiguous views/reshapes/reinterprets is just not practical, leading to ambiguities and slow method insertion times. Essentially, the following Union will now just be represented by the array type itself:

DenseJLArray (alias for Union{JLArray{T, N}, Base.ReinterpretArray{T, N, S, A} where {A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, Base.ReshapedArray{T, N, A} where A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, SubArray{T, N, A, I, true} where {A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, Base.ReshapedArray{T, N, A} where {T, N, A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}}, JLArray}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}} where {T, N})

API changes

To make this possible, array back-ends will need to wrap their data in a DataRef and call GPUArrays.unsafe_free! instead of the actual finalizer, which is now registered with the DataRef:

buf = ...
managed_buf = DataRef(buf) do buf
  free(buf)
end
obj = new(managed_buf)
finalizer(GPUArrays.unsafe_free!, obj)

To get a hold of the underlying data, call getindex, i.e., managed_buf[]. To create a new reference that shares the underlying data, call copy(managed_buf).

The second big part is a new interface method, GPUArrays.derive(::Type{T}, N::Int, a::AbstractGPUArray, osize::Dims, offset::Int). This method is used by the view/reshape/reinterpret functionality to create a derived array, so this method will need to call copy(::DataRef). Also note the new offset parameter, which will need to be added to the array structure, keeping track of the offset (in number of elements) for the purpose of implementing views. A simple implementation of this method could be:

function GPUArrays.derive(::Type{T}, N::Int, a::MyGPUArray, dims::Dims, offset::Int) where {T}
    data = copy(a.data)
    offset = (a.offset * Base.elsize(a)) ÷ sizeof(T) + offset
    MyGPUArray{T,N}(data, dims; offset)
end

Fixes JuliaGPU/Metal.jl#149

cc @jpsamaroo

pxl-th · 2023-04-25T12:47:03Z

I've opened a PR that migrates AMDGPU to GPUArray's storage handling: JuliaGPU/AMDGPU.jl#416
All seems to work fine.

There's one suggestion.
It'd be good to be able to free buffers immediately if the users calls unsafe_free!.
With current implementation you need to call unsafe_free! on all derived arrays.

pxl-th@Leleka:~/.julia/dev/AMDGPU$ JULIA_DEBUG=AMDGPU amdjl

julia> using AMDGPU

julia> x = ROCArray(ones(Float32, 16));
┌ Debug: Allocating 64 bytes from ROCMemoryPool @ 0x00000000025531a0: Segment global, Flags (coarsegrained), Size 11.984 GiB (11.984 GiB max allocation), Runtime Alloc: true (4.000 KiB granularity, 4.000 KiB alignment), All Accessible: false
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:260
┌ Debug: Allocation phase 1: HSA_STATUS_SUCCESS
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:295

julia> y = reshape(x, 4, 4);

julia> AMDGPU.unsafe_free!(x)

julia> AMDGPU.unsafe_free!(y)
┌ Debug: Freed 64 bytes @ Ptr{Nothing} @0x00007fd820200000 ptr from pool
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:366

Currently, AMDGPU has this implemented.
If the users calls AMDGPU.unsafe_free! it frees buffers immediately and ROCArray finalizers call AMDGPU._safe_free! which just decreases refcount if the array was previously freed by AMDGPU.unsafe_free!.

maleadt · 2023-04-26T07:42:30Z

It'd be good to be able to free buffers immediately if the users calls unsafe_free!.
With current implementation you need to call unsafe_free! on all derived arrays.

That's on purpose. It would be very dangerous if unsafe_free! frees all instances of an array. The idea is that you use unsafe_free! as such:

function whatever(a)
  b = view(a, :)
  # use b
  unsafe_free!(b)
  # return something
end

If unsafe_free! would destroy a too it would be impossible to use it in library functions.

pxl-th · 2023-04-26T09:19:36Z

If unsafe_free! would destroy a too it would be impossible to use it in library functions.

Indeed. However, if some underlying function (from the library you are not in control) derives from your input argument, you are left at the mercy of GC.
And in practice this creates a noticeable delay, enough to trigger infamous OOM error on AMDGPU which crashes the runtime.
Which I'm observing again with this transition.

Hence, AMDGPU began to free immediately.
I was avoiding calling unsafe_free! on input arguments & derived arrays of input arguments for that reason inside functions.
If the user performs stream synchronization before calling unsafe_free! it should be safe to immediately do so.

function whatever(a)
  b = view(a, :)
  # use b
  return something
end

a = ...
something = whatever(a)
synchronize()
unsafe_free!(a)

pxl-th · 2023-04-26T09:30:24Z

For example, this will not free m, because of broadcast. And you are not really in control of it.

julia> x = AMDGPU.rand(Float32, 16, 2);

julia> m = mean(x; dims=1);

julia> y = x .- m;

julia> AMDGPU.unsafe_free!(m)

maleadt · 2023-04-26T10:02:09Z

And in practice this creates a noticeable delay, enough to trigger infamous OOM error on AMDGPU which crashes the runtime.
Which I'm observing again with this transition.

Well, the previous situation was unsafe, where you could derive a SubArray while freeing the parent array. So from a correctness POV, this is a strict improvement.

Hence, AMDGPU began to free immediately.

Yeah, invalidly so, since there was still a reachable SubArray out there.

Finally, all this shouldn't really cause an OOM if AMDGPU.jl calls GC.gc(...) when allocating fails.

pxl-th · 2023-04-26T10:06:14Z

Finally, all this shouldn't really cause an OOM if AMDGPU.jl calls GC.gc(...) when allocating fails.

It does. AMDGPU has allocation/retry mechanism.
But not all allocations come from the Julia side.
At kernel launch time, HSA allocates scratch memory and it it fails to do so (when there's no memory left), then it crashes.

pxl-th · 2023-04-26T10:16:07Z

Here's where that happens: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/5e08dcc5159325dfef5a48c2668b06fd6fef0469/src/core/runtime/amd_gpu_agent.cpp#L1329

Performing allocations only on the Julia side does not lead to crashes.

for i in 1_000_000
    ROCArray{Float32}(undef, 1024 * 1024)
end

But if there are kernel launches that use non-zero amount of scratch memory, then at some point OOM will happen on scratch allocation side and not on Julia side.

pxl-th · 2023-04-26T11:19:43Z

IMO, users should be aware if they are still using derived arrays before calling unsafe_free!.
And if they passed it to a function which derives, then they should synchronize before freeing.

maleadt · 2023-04-26T12:44:59Z

At kernel launch time, HSA allocates scratch memory and it it fails to do so (when there's no memory left), then it crashes.

Well, then AMDGPU.jl should check before launching kernels that sufficient memory is available. Or something like that, I'm playing armchair expert here.

I agree that the whole interaction between GPU allocations and the GC is suboptimal, especially in the context of external allocations, but that's hardly an argument to make a proposed API way too overeager for most use cases (and IMO just wrong).

IMO, users should be aware if they are still using derived arrays before calling unsafe_free!.

I disagree. unsafe_free! is semantically a local operation that applies to the current instance of an object, and nothing more. If you want it to kill all memory, write your own version that peeks inside and frees the contained buffer, and/or drops the refcount accordingly.

And if they passed it to a function which derives, then they should synchronize before freeing.

I disagree. Users should not have to think about that.

maleadt · 2023-04-26T12:53:38Z

Again, I'm sympathetic to the view, but I think it would make unsafe_free! just too dangerous. Currently it's relatively safe to apply it to objects you control (either because you allocated them, or because you derived it from an existing array), and that wouldn't be the case if it applies to all memory related to the object. And sure, the currently behavior happens to work like that, but I'd consider that a bug.

@jpsamaroo Would it be possible to avoid these crashes on external allocations differently? Can you cheaply check for the free memory? At some point, in CUDA.jl we kept track of allocated memory so that we could check & free cheaply (in our case, checked on every allocation).

pxl-th · 2023-04-26T13:06:10Z

I disagree. Users should not have to think about that.

But they do have to think when they are performing operations on derived arrays which modify all instances that share the underlying data.
Extending unsafe_free! to that is natural (of course this is subjective).

Can you cheaply check for the free memory?

HSA has added an API for checking the amount of free memory only in ROCm 5.3+ (which also includes currently allocated scratch memory).

Alternatively, AMDGPU also keeps the count of all allocated memory, but that will be optimistic.
Since it does not account for other processes and makes no distinction about pools & regions (and how much memory is available on each), just lumps them together (which we can modify).
And we'd need to guess some kind of boundary.

maleadt · 2023-04-26T13:39:03Z

But they do have to think when they are performing operations on derived arrays which modify all instances that share the underlying data.

But those operations can be read-only, in which case you can safely unsafe_free! without synchronizing.

I wouldn't be opposed to adding, say, a keyword argument to unsafe_free! that makes it behave recursively (or force=true or something). But I'd prefer for the default to err on the side of correctness.

pxl-th · 2023-04-26T14:10:49Z

Having a way to force it via flag would be good.

pxl-th · 2023-06-27T13:31:28Z

I think we can go as is with this PR.
AMDGPU now has (on pxl-th/hiprtc branch for now) much more robust alloc/retry mechanism.
In most places where unsafe_free! was used, they are no longer needed.

And this PR removes quite a bit of lines, so good to have :)

maleadt · 2023-08-31T15:31:32Z

Updated, and updated back-ends.

@pxl-th Anything that needs to happen here to make AMDGPU.jl work?

pxl-th · 2023-08-31T21:55:50Z

Updated, and updated back-ends.

@pxl-th Anything that needs to happen here to make AMDGPU.jl work?

I'm almost sure that no, but I'll double check tomorrow.

maleadt · 2023-09-01T09:54:57Z

I'll merge and tag already so that I can move forwards, but do let me know if there are any issues.

pxl-th · 2023-09-01T12:39:13Z

All works fine, thanks!
Can we tag a release or there are some blockers left?

maleadt · 2023-09-01T14:01:16Z

Tagged as 9.0!

maleadt force-pushed the tb/dev branch from ee696a8 to 0080fa2 Compare April 12, 2023 09:54

maleadt changed the title ~~Port functionality from CUDA.jl.~~ Port storage handling + wrapper materialization from CUDA.jl. Apr 12, 2023

maleadt force-pushed the tb/dev branch from 0080fa2 to e18eddd Compare April 12, 2023 10:50

maleadt mentioned this pull request Apr 14, 2023

MPSMatrix improvements JuliaGPU/Metal.jl#157

Merged

pxl-th mentioned this pull request Apr 25, 2023

Use GPUArray storage handling JuliaGPU/AMDGPU.jl#416

Closed

Port functionality from CUDA.jl.

e5f5a98

maleadt force-pushed the tb/dev branch from e18eddd to e5f5a98 Compare August 30, 2023 15:24

Don't use Float64 in tests.

9a8c7a2

maleadt force-pushed the tb/dev branch from b92af36 to 9a8c7a2 Compare August 30, 2023 20:26

maleadt added 5 commits August 31, 2023 09:44

Simplify error tests, and don't assume bitsunion support.

cb26a7a

NFC changes.

628ea5e

Fix arg passing.

606b1b9

Rename variable.

a4c3998

Update comments.

0895d25

This was referenced Aug 31, 2023

Switch to GPUArrays buffer management. JuliaGPU/oneAPI.jl#358

Merged

Switch to GPUArrays buffer management. JuliaGPU/Metal.jl#251

Merged

maleadt mentioned this pull request Aug 31, 2023

Switch to GPUArrays buffer management. JuliaGPU/CUDA.jl#2068

Merged

maleadt merged commit 752a953 into master Sep 1, 2023

maleadt deleted the tb/dev branch September 1, 2023 09:55

pxl-th mentioned this pull request Sep 1, 2023

Switch to GPUArrays buffer management JuliaGPU/AMDGPU.jl#481

Merged

maleadt mentioned this pull request Jan 9, 2024

Port subarray/reinterpret/reshape code from CUDA.jl #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port storage handling + wrapper materialization from CUDA.jl. #468

Port storage handling + wrapper materialization from CUDA.jl. #468

maleadt commented Apr 11, 2023 •

edited

Loading

pxl-th commented Apr 25, 2023 •

edited

Loading

maleadt commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023

pxl-th commented Apr 26, 2023

maleadt commented Apr 26, 2023 •

edited

Loading

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023 •

edited

Loading

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023

pxl-th commented Jun 27, 2023 •

edited

Loading

maleadt commented Aug 31, 2023

pxl-th commented Aug 31, 2023

maleadt commented Sep 1, 2023

pxl-th commented Sep 1, 2023

maleadt commented Sep 1, 2023

Port storage handling + wrapper materialization from CUDA.jl. #468

Port storage handling + wrapper materialization from CUDA.jl. #468

Conversation

maleadt commented Apr 11, 2023 • edited Loading

Motivation

API changes

pxl-th commented Apr 25, 2023 • edited Loading

maleadt commented Apr 26, 2023 • edited Loading

pxl-th commented Apr 26, 2023 • edited Loading

pxl-th commented Apr 26, 2023

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023 • edited Loading

pxl-th commented Apr 26, 2023

pxl-th commented Apr 26, 2023

maleadt commented Apr 26, 2023 • edited Loading

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023 • edited Loading

maleadt commented Apr 26, 2023

pxl-th commented Apr 26, 2023

pxl-th commented Jun 27, 2023 • edited Loading

maleadt commented Aug 31, 2023

pxl-th commented Aug 31, 2023

maleadt commented Sep 1, 2023

pxl-th commented Sep 1, 2023

maleadt commented Sep 1, 2023

maleadt commented Apr 11, 2023 •

edited

Loading

pxl-th commented Apr 25, 2023 •

edited

Loading

maleadt commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023 •

edited

Loading

maleadt commented Apr 26, 2023 •

edited

Loading

pxl-th commented Apr 26, 2023 •

edited

Loading

pxl-th commented Jun 27, 2023 •

edited

Loading