-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388
Comments
Overall I'm worried about modifying the buffer mapping mechanisms this late, when it was perhaps the single most difficult part of the API to find a good design for and reach consensus on. I think a UMA optional feature would make a We discussed the slight inefficiencies that UMA has with the current mapping mechanism multiple times in F2F, and #605 suggests a UMA feature could be done to optimize this later (later could be now). The proposal is interesting but has some issues. Side note: there is also Multiple buffers per
|
I was going to have comments but @Kangz covered everything I was going to say and more. |
This isn't true. This proposal requires scratch space, certainly, but so does |
This is a good point! I suppose this proposal only makes sense for read/write buffers (which we don't have today, but I think has a natural path forward). |
It's relatively rare for an application to actually need read/write mapping. Sure we could add them for this use-case, but applications would still need to know which one to use and explicitly switch between them based on whether the adapter is UMA or not. |
|
I'm not saying I'm for or against any proposal here, only voicing that I agree what's in the API today does not fully satisfy workflows with dynamic data moving across host/device and that this will be a performance issue in real-world usage. In compute workloads getting data back from the device is a major part of the upload -> compute -> download flow and until we have a
👍 IMO having the implementation manage the ringbuffer with writeBuffer/readBuffer and incurring a copy is acceptable if the alternative is managing exclusively upload or download buffers in user code (as I found the spec detail that indicates a buffer cannot be both, resulting in user staging pools needing double the memory for bidi transfer). This way multiple libraries trying to perform upload/download are not each keeping around large GPUBuffers for this purpose. |
This is no worse than the possibility of an OOM when trying to |
The point here wasn't that |
I agree with the concerns about managing temporary buffers for mapping, expressed by @Kangz . Their lifetime is attached to mapping, and it's worse than the ring buffer we currently have for the writes. I also agree with @litherum that it would be good to be able to avoid copies on systems that can do that. As for the queue argument for mapping, this correlates with #1977 (comment). It's probably needed. |
I'm happy to help by writing an optional feature for UMA that allows |
It would be pretty unfortunate if authors had to opt-in to avoid using 2x memory on UMA machines. |
I don't think anyone is disagreeing about that, but if we're going to avoid it we're going to need a proposal that works. I don't think we're getting any closer to one. |
WebGPU meeting minutes 2022-02-23
|
So here's the proposal for the extension: what I wrote above
With the addition that if readonly |
WebGPU meeting minutes 2022-03-16
|
As discussed in the meeting, moving to post-V1 polish since the only proposal so far is an optional feature. |
GPU Web 2023-06-07/08 (Pacific time)
|
The current restriction on buffers created with map flags has problems also on NUMA architectures. Small, frequently updated uniform buffers can be stored in system memory without significant impact on performance. In addition, with the advent of Resizable BAR and SAM, it is possible to write data directly to VRAM using the CPU (we can even write textures directly to VRAM and change later the access pattern from linear to swizzled, for better bandwidth) |
Background
WebGPU currently has 2 buffer upload facilities:
GPUBuffer.mapAsync()
andGPUBuffer.writeBuffer()
.For
GPUBuffer.mapAsync()
, there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to usemapAsync()
, an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to usingmapAsync()
on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.On the other hand,
GPUQueue.writeBuffer()
is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation ofwriteBuffer()
is required to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (GPUBuffer.writeBuffer()
is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)It would be valuable if we could combine the best parts of
GPUBuffer.mapAsync()
andGPUQueue.writeBuffer()
into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.Goals
GPUBuffer.mapAsync()
would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy.GPUBuffer.mapAsync()
would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems.GPUQueue.writeBuffer()
would be valuable, because non-UMA systems would need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.Proposal
I think the most natural solution to this would be:
mapAsync()
an extraGPUQueue
argument. (getMappedRange()
andunmap()
will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead.That's it!
mapAsync()
would just ignore itsGPUQueue
argument.mapAsync()
would use itsGPUQueue
argument to schedule aclearBuffer()
command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.mapAsync()
would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself.mapAsync()
would just stash the queue, map a temporary buffer, and wait forunmap()
to be called. Whenunmap()
is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.
This proposal also has a natural path forward for read/write mapping.
The text was updated successfully, but these errors were encountered: