Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port storage handling + wrapper materialization from CUDA.jl. #468

Merged
merged 7 commits into from
Sep 1, 2023

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Apr 11, 2023

This PR ports 3 pieces of functionality from CUDA.jl:

  • a buffer refcount type (ArrayStorage in CUDA.jl) so that we can create array objects that share data;
  • using that, functionality to represent simple views/reshapes/reinterprets without using Base wrappers;
  • and unrelated, reusable functionality for reporting on unsupported element types.

Motivation

The goal of these changes is to make CUDA.jl's array wrapper optimization available for all back-ends. It consists of avoiding common array wrappers, as they complicate wrappers (necessitating complex Union-typed arguments). It's not great that we have to do this, but sadly that's the reality with Julia's array types; extending every method to use a Dense array alias that covers contiguous views/reshapes/reinterprets is just not practical, leading to ambiguities and slow method insertion times. Essentially, the following Union will now just be represented by the array type itself:

DenseJLArray (alias for Union{JLArray{T, N}, Base.ReinterpretArray{T, N, S, A} where {A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, Base.ReshapedArray{T, N, A} where A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, SubArray{T, N, A, I, true} where {A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, Base.ReshapedArray{T, N, A} where {T, N, A<:Union{Base.ReinterpretArray{T, N, S, A} where {T, N, A<:Union{SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}, S}, SubArray{T, N, A, I, true} where {T, N, A<:JLArray, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}, JLArray}}, JLArray}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}}} where {T, N})

API changes

To make this possible, array back-ends will need to wrap their data in a DataRef and call GPUArrays.unsafe_free! instead of the actual finalizer, which is now registered with the DataRef:

buf = ...
managed_buf = DataRef(buf) do buf
  free(buf)
end
obj = new(managed_buf)
finalizer(GPUArrays.unsafe_free!, obj)

To get a hold of the underlying data, call getindex, i.e., managed_buf[]. To create a new reference that shares the underlying data, call copy(managed_buf).

The second big part is a new interface method, GPUArrays.derive(::Type{T}, N::Int, a::AbstractGPUArray, osize::Dims, offset::Int). This method is used by the view/reshape/reinterpret functionality to create a derived array, so this method will need to call copy(::DataRef). Also note the new offset parameter, which will need to be added to the array structure, keeping track of the offset (in number of elements) for the purpose of implementing views. A simple implementation of this method could be:

function GPUArrays.derive(::Type{T}, N::Int, a::MyGPUArray, dims::Dims, offset::Int) where {T}
    data = copy(a.data)
    offset = (a.offset * Base.elsize(a)) ÷ sizeof(T) + offset
    MyGPUArray{T,N}(data, dims; offset)
end

Fixes JuliaGPU/Metal.jl#149

cc @jpsamaroo

@maleadt maleadt changed the title Port functionality from CUDA.jl. Port storage handling + wrapper materialization from CUDA.jl. Apr 12, 2023
@pxl-th
Copy link
Member

pxl-th commented Apr 25, 2023

I've opened a PR that migrates AMDGPU to GPUArray's storage handling: JuliaGPU/AMDGPU.jl#416
All seems to work fine.

There's one suggestion.
It'd be good to be able to free buffers immediately if the users calls unsafe_free!.
With current implementation you need to call unsafe_free! on all derived arrays.

pxl-th@Leleka:~/.julia/dev/AMDGPU$ JULIA_DEBUG=AMDGPU amdjl

julia> using AMDGPU

julia> x = ROCArray(ones(Float32, 16));
┌ Debug: Allocating 64 bytes from ROCMemoryPool @ 0x00000000025531a0: Segment global, Flags (coarsegrained), Size 11.984 GiB (11.984 GiB max allocation), Runtime Alloc: true (4.000 KiB granularity, 4.000 KiB alignment), All Accessible: false
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:260
┌ Debug: Allocation phase 1: HSA_STATUS_SUCCESS
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:295

julia> y = reshape(x, 4, 4);

julia> AMDGPU.unsafe_free!(x)

julia> AMDGPU.unsafe_free!(y)
┌ Debug: Freed 64 bytes @ Ptr{Nothing} @0x00007fd820200000 ptr from pool
└ @ AMDGPU.Runtime.Mem ~/.julia/dev/AMDGPU/src/runtime/memory.jl:366

Currently, AMDGPU has this implemented.
If the users calls AMDGPU.unsafe_free! it frees buffers immediately and ROCArray finalizers call AMDGPU._safe_free! which just decreases refcount if the array was previously freed by AMDGPU.unsafe_free!.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2023

It'd be good to be able to free buffers immediately if the users calls unsafe_free!.
With current implementation you need to call unsafe_free! on all derived arrays.

That's on purpose. It would be very dangerous if unsafe_free! frees all instances of an array. The idea is that you use unsafe_free! as such:

function whatever(a)
  b = view(a, :)
  # use b
  unsafe_free!(b)
  # return something
end

If unsafe_free! would destroy a too it would be impossible to use it in library functions.

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

If unsafe_free! would destroy a too it would be impossible to use it in library functions.

Indeed. However, if some underlying function (from the library you are not in control) derives from your input argument, you are left at the mercy of GC.
And in practice this creates a noticeable delay, enough to trigger infamous OOM error on AMDGPU which crashes the runtime.
Which I'm observing again with this transition.

Hence, AMDGPU began to free immediately.
I was avoiding calling unsafe_free! on input arguments & derived arrays of input arguments for that reason inside functions.
If the user performs stream synchronization before calling unsafe_free! it should be safe to immediately do so.

function whatever(a)
  b = view(a, :)
  # use b
  return something
end

a = ...
something = whatever(a)
synchronize()
unsafe_free!(a)

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

For example, this will not free m, because of broadcast. And you are not really in control of it.

julia> x = AMDGPU.rand(Float32, 16, 2);

julia> m = mean(x; dims=1);

julia> y = x .- m;

julia> AMDGPU.unsafe_free!(m)

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2023

And in practice this creates a noticeable delay, enough to trigger infamous OOM error on AMDGPU which crashes the runtime.
Which I'm observing again with this transition.

Well, the previous situation was unsafe, where you could derive a SubArray while freeing the parent array. So from a correctness POV, this is a strict improvement.

Hence, AMDGPU began to free immediately.

Yeah, invalidly so, since there was still a reachable SubArray out there.

Finally, all this shouldn't really cause an OOM if AMDGPU.jl calls GC.gc(...) when allocating fails.

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

Finally, all this shouldn't really cause an OOM if AMDGPU.jl calls GC.gc(...) when allocating fails.

It does. AMDGPU has allocation/retry mechanism.
But not all allocations come from the Julia side.
At kernel launch time, HSA allocates scratch memory and it it fails to do so (when there's no memory left), then it crashes.

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

Here's where that happens: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/5e08dcc5159325dfef5a48c2668b06fd6fef0469/src/core/runtime/amd_gpu_agent.cpp#L1329

Performing allocations only on the Julia side does not lead to crashes.

for i in 1_000_000
    ROCArray{Float32}(undef, 1024 * 1024)
end

But if there are kernel launches that use non-zero amount of scratch memory, then at some point OOM will happen on scratch allocation side and not on Julia side.

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

IMO, users should be aware if they are still using derived arrays before calling unsafe_free!.
And if they passed it to a function which derives, then they should synchronize before freeing.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2023

At kernel launch time, HSA allocates scratch memory and it it fails to do so (when there's no memory left), then it crashes.

Well, then AMDGPU.jl should check before launching kernels that sufficient memory is available. Or something like that, I'm playing armchair expert here.

I agree that the whole interaction between GPU allocations and the GC is suboptimal, especially in the context of external allocations, but that's hardly an argument to make a proposed API way too overeager for most use cases (and IMO just wrong).

IMO, users should be aware if they are still using derived arrays before calling unsafe_free!.

I disagree. unsafe_free! is semantically a local operation that applies to the current instance of an object, and nothing more. If you want it to kill all memory, write your own version that peeks inside and frees the contained buffer, and/or drops the refcount accordingly.

And if they passed it to a function which derives, then they should synchronize before freeing.

I disagree. Users should not have to think about that.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2023

Again, I'm sympathetic to the view, but I think it would make unsafe_free! just too dangerous. Currently it's relatively safe to apply it to objects you control (either because you allocated them, or because you derived it from an existing array), and that wouldn't be the case if it applies to all memory related to the object. And sure, the currently behavior happens to work like that, but I'd consider that a bug.

@jpsamaroo Would it be possible to avoid these crashes on external allocations differently? Can you cheaply check for the free memory? At some point, in CUDA.jl we kept track of allocated memory so that we could check & free cheaply (in our case, checked on every allocation).

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

I disagree. Users should not have to think about that.

But they do have to think when they are performing operations on derived arrays which modify all instances that share the underlying data.
Extending unsafe_free! to that is natural (of course this is subjective).

Can you cheaply check for the free memory?

HSA has added an API for checking the amount of free memory only in ROCm 5.3+ (which also includes currently allocated scratch memory).

Alternatively, AMDGPU also keeps the count of all allocated memory, but that will be optimistic.
Since it does not account for other processes and makes no distinction about pools & regions (and how much memory is available on each), just lumps them together (which we can modify).
And we'd need to guess some kind of boundary.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2023

But they do have to think when they are performing operations on derived arrays which modify all instances that share the underlying data.

But those operations can be read-only, in which case you can safely unsafe_free! without synchronizing.

I wouldn't be opposed to adding, say, a keyword argument to unsafe_free! that makes it behave recursively (or force=true or something). But I'd prefer for the default to err on the side of correctness.

@pxl-th
Copy link
Member

pxl-th commented Apr 26, 2023

Having a way to force it via flag would be good.

@pxl-th
Copy link
Member

pxl-th commented Jun 27, 2023

I think we can go as is with this PR.
AMDGPU now has (on pxl-th/hiprtc branch for now) much more robust alloc/retry mechanism.
In most places where unsafe_free! was used, they are no longer needed.

And this PR removes quite a bit of lines, so good to have :)

@maleadt
Copy link
Member Author

maleadt commented Aug 31, 2023

Updated, and updated back-ends.

@pxl-th Anything that needs to happen here to make AMDGPU.jl work?

@pxl-th
Copy link
Member

pxl-th commented Aug 31, 2023

Updated, and updated back-ends.

@pxl-th Anything that needs to happen here to make AMDGPU.jl work?

I'm almost sure that no, but I'll double check tomorrow.

@maleadt
Copy link
Member Author

maleadt commented Sep 1, 2023

I'll merge and tag already so that I can move forwards, but do let me know if there are any issues.

@maleadt maleadt merged commit 752a953 into master Sep 1, 2023
@maleadt maleadt deleted the tb/dev branch September 1, 2023 09:55
@pxl-th
Copy link
Member

pxl-th commented Sep 1, 2023

All works fine, thanks!
Can we tag a release or there are some blockers left?

@maleadt
Copy link
Member Author

maleadt commented Sep 1, 2023

Tagged as 9.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random access indexing into MtlArray views cause scalar indexing
2 participants