Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

Open
litherum opened this issue Dec 7, 2021 · 21 comments
Labels
api WebGPU API large
Milestone

Comments

@litherum
Copy link
Contributor

litherum commented Dec 7, 2021

Background

WebGPU currently has 2 buffer upload facilities: GPUBuffer.mapAsync() and GPUBuffer.writeBuffer().

For GPUBuffer.mapAsync(), there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to use mapAsync(), an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to using mapAsync() on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.

On the other hand, GPUQueue.writeBuffer() is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation of writeBuffer() is required to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (GPUBuffer.writeBuffer() is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)

It would be valuable if we could combine the best parts of GPUBuffer.mapAsync() and GPUQueue.writeBuffer() into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.

Goals

  1. The "async" part of GPUBuffer.mapAsync() would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy.
  2. The "map" part of GPUBuffer.mapAsync() would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems.
  3. The "queue" part of GPUQueue.writeBuffer() would be valuable, because non-UMA systems would need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.

Proposal

I think the most natural solution to this would be:

  1. Give mapAsync() an extra GPUQueue argument. (getMappedRange() and unmap() will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead.
  2. Relax the requirement that the only other usage a mappable buffer can have is COPY

That's it!

  • On a UMA system, you'd be able to map the destination (storage) buffer directly - No copies, no extra allocations, it's living the UMA dream.
    • For reading, mapAsync() would just ignore its GPUQueue argument.
    • For writing, mapAsync() would use its GPUQueue argument to schedule a clearBuffer() command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.
  • On a non-UMA system:
    • For reading, mapAsync() would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself.
    • For writing, mapAsync() would just stash the queue, map a temporary buffer, and wait for unmap() to be called. When unmap() is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.

It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.

This proposal also has a natural path forward for read/write mapping.

@litherum litherum added this to the V1.0 milestone Dec 7, 2021
@litherum litherum changed the title Cannot upload to UMA storage buffer without an unnecessary copy and unnecessary memory use Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use Dec 7, 2021
@Kangz
Copy link
Contributor

Kangz commented Dec 7, 2021

Overall I'm worried about modifying the buffer mapping mechanisms this late, when it was perhaps the single most difficult part of the API to find a good design for and reach consensus on. I think a UMA optional feature would make a

We discussed the slight inefficiencies that UMA has with the current mapping mechanism multiple times in F2F, and #605 suggests a UMA feature could be done to optimize this later (later could be now). The proposal is interesting but has some issues.

Side note: there is also mappedAtCreation which was added so that the initial upload of data in buffers can be made perfectly efficient on UMA systems. (up to the copies necessary for the process separation).

Multiple buffers per GPUBuffer

It breaks down the model that one WebGPU buffer == one underlying API buffer. This is a pretty useful thing to keep because it makes it very clear to the developer what the memory cost of things is. That you can have temporary staging for mappedAtCreation, and potentially shmem wrapped in the ArrayBuffer given to JS is already very difficult for developers to reason about in terms of cost.

Cost of consistency for MAP_WRITE

Currently MAP_WRITE buffers give you an ArrayBuffer that contains the current content of the buffer. Since the buffer can only be written by Javascript, no copies are ever needed to update the content of the buffer, it's just the ArrayBuffer wrapping shmem shared between the GPU and Web processes. If the GPU can write to the buffer, then we need to copy data from the UMA buffer to the shmem (or even worse, from VRAM to readback to shmem).

On the other side, if we say that mapAsync(MAP_WRITE) always zeroes the buffer, then in most cases the CPU has to zero the buffer, since the UMA/readback buffer isn't shared with the Web process. Either there's a memset(0) or a memcpy from the UMA buffer to the shmem.

Consistency for MAP_READ

What happens when Javascript writes into the buffer mapped for reading? Assuming you are able to create an MTLBuffer from a shmem FD to reduce the number of copies to the maximum, the writes that Javascript did all of a sudden become visible to the GPU, while an all other configurations, Javascript writing to the buffer doesn't have any visible effect for the GPU?

Relatively little gains and a feature proposal

The gains you get with the proposal you suggested seem small: if you have a large amount of data to initialize buffers with, then you can use mappedAtCreation that's the optimal path. If you need to modify part of a buffer while it's in use, then you have to schedule a copy because mapping is an ownership transfer of the full buffer (I tried to figure out how to do sub-range mapping efficiently but gave up).

So the cases this help are when you need to upload data to a buffer that's not being created, but also not currently in use by the GPU. This should be a fraction of the actual buffer transfers. Still might be worth speccing an optional feature, but not modifying the core buffer mapping spec.

The optional "UMA" feature could:

  • Lift the restriction for MAP_WRITE to allow any other read-only usages.
  • MAP_READ already allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).

@kainino0x
Copy link
Contributor

I was going to have comments but @Kangz covered everything I was going to say and more.

@litherum
Copy link
Contributor Author

litherum commented Dec 8, 2021

It breaks down the model that one WebGPU buffer == one underlying API buffer.

This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.

@litherum
Copy link
Contributor Author

litherum commented Dec 8, 2021

What happens when Javascript writes into the buffer mapped for reading?

This is a good point! I suppose this proposal only makes sense for read/write buffers (which we don't have today, but I think has a natural path forward).

@kainino0x
Copy link
Contributor

It's relatively rare for an application to actually need read/write mapping. Sure we could add them for this use-case, but applications would still need to know which one to use and explicitly switch between them based on whether the adapter is UMA or not.

@Kangz
Copy link
Contributor

Kangz commented Dec 9, 2021

This isn't true. This proposal requires scratch space, certainly, but so does writeBuffer(). It's no worse.

WriteBuffer is quite explicitly an implementation-managed ringbuffer. But it is not tied to a GPUBuffer, it's only GPUDevice extra memory. Plus it doesn't need to be persistent data. The implementation can destroy the ringbuffers when there is memory pressure while extra backings for GPUBuffer would have to stay, otherwise you could get an OOM trying to map the buffer.

@benvanik
Copy link

benvanik commented Dec 10, 2021

I'm not saying I'm for or against any proposal here, only voicing that I agree what's in the API today does not fully satisfy workflows with dynamic data moving across host/device and that this will be a performance issue in real-world usage. In compute workloads getting data back from the device is a major part of the upload -> compute -> download flow and until we have a GPUQueue.readBuffer (🙏 please!) this results in non-trivial complexity in user code and bloat.

But it is not tied to a GPUBuffer, it's only GPUDevice extra memory.

👍 IMO having the implementation manage the ringbuffer with writeBuffer/readBuffer and incurring a copy is acceptable if the alternative is managing exclusively upload or download buffers in user code (as I found the spec detail that indicates a buffer cannot be both, resulting in user staging pools needing double the memory for bidi transfer). This way multiple libraries trying to perform upload/download are not each keeping around large GPUBuffers for this purpose.

@litherum
Copy link
Contributor Author

you could get an OOM trying to map the buffer.

This is no worse than the possibility of an OOM when trying to writeBuffer(), though...

@Kangz
Copy link
Contributor

Kangz commented Dec 13, 2021

Sure that's a possibility as well, although implementation could stall to free staging space if they really want to.

The point here wasn't that writeBuffer can or cannot OOM, it is that you want to give developers a way to do transfers without the possibility to trigger unfixable OOM. Buffer mapping can do that since OOM only happens at buffer creation. If you choose to make staging / readback buffer transient, then you lose this control in the application because you can OOM on mapAsync as well. It is possible to decide to do that, but need to be cognizant of all the tradeoffs we are making. In this whole comment thread I suggest it is a bad idea for many reasons, including that OOM issue.

@kvark
Copy link
Contributor

kvark commented Dec 14, 2021

I agree with the concerns about managing temporary buffers for mapping, expressed by @Kangz . Their lifetime is attached to mapping, and it's worse than the ring buffer we currently have for the writes.

I also agree with @litherum that it would be good to be able to avoid copies on systems that can do that.
An optional feature for UMA architectures seems like the right way to proceed. It would basically lift the restriction on usages for buffers, allowing MAP_READ+MAP_WRITE+anything else.

As for the queue argument for mapping, this correlates with #1977 (comment). It's probably needed.

@Kangz
Copy link
Contributor

Kangz commented Dec 14, 2021

I'm happy to help by writing an optional feature for UMA that allows MAP_READ + WRITE.

@litherum
Copy link
Contributor Author

It would be pretty unfortunate if authors had to opt-in to avoid using 2x memory on UMA machines.

@kainino0x
Copy link
Contributor

I don't think anyone is disagreeing about that, but if we're going to avoid it we're going to need a proposal that works. I don't think we're getting any closer to one.

@kdashg
Copy link
Contributor

kdashg commented Mar 2, 2022

WebGPU meeting minutes 2022-02-23
  • KN: nobody satisfied with current state, but nobody has a better idea. Everyone's resigned to this fate except Myles. :)
  • KG: one thing that has changed since it was first discussed - more common today than 2 years ago to get adapters that let you map CPU read and host read/device use - used to be UMA archs only, and some AMD cards - but has changed now.
  • KN: right. Intel doesn't even have some of these options (host-coherent + device-coherent?). Think we could do this on Intel regardless.
  • KG: if something we can't support - don't want to fragment the ecosystem by making you write 2 paths. If things have changed - still do need that.
  • KN: we don't have solution for doing this underneath the hood of the application. Can do it with an extension. Would like to. Maybe we should do it for 1.0. Would need separate code path for application.
  • KR: WebGL doesn't have the ability to optimize for this and performance in this area is basically fine. I think WebGPU will also perform fine in general without this optimization, and since applications will have to add a new code path to take advantage of it, think this should be pushed out to post-V1.

@Kangz
Copy link
Contributor

Kangz commented Mar 16, 2022

So here's the proposal for the extension: what I wrote above

The optional "UMA" feature could:

 - Lift the restriction for MAP_WRITE to allow any other read-only usages.
 - MAP_READ already allows all the write-only usages. But maybe more are added in the future so the extension would also lift that? Or it allows it with any other usages, but assumes there is always a UMA -> shmem copy happening in the GPU process (so that JS writes are never made visible to the GPU).

With the addition that if readonly ArrayBuffers become a thing, then we can lift all restrictions on MAP_READ (except MAP_WRITE? not sure), by making the ArrayBuffer returned by mappings for reading be readonly.

@kdashg
Copy link
Contributor

kdashg commented Mar 16, 2022

WebGPU meeting minutes 2022-03-16
  • Myles has been writing lots of webgpu patches instead of thinking about this; can we defer a week?
  • CW: UMA storage buffers - I made a proposal a couple lines long. We can have a UMA extension - enable map() with writable buffers with READ_ONLY usage. And vice versa.
  • CW: If later we have read-only ArrayBuffers we can have other functionality.
  • CW: would be nice if we were able to say - you can have any usage and it just works - but not possible on D3D, and isn't best thing to do. Lots of complexity, e.g. with discrete GPU and also cross-process. Can have a memory "thing" which spans all 3 items - GPU, GPU process, renderer process. Also need consistent behaviors on all systems. Writes to JS have to be visible to JS for readable buffers. Complicated.
  • CW: that's why I think only way for proper UMA support is via an extension.
  • MM: would this extension also be present on discrete cards? And extension would say, your writes might not be present if you read form it?
  • CW: no, behavior should be consistent always, regardless of extension being enabled. That's the main goal.
  • MM: so app needs: if (uma) { … } else { … }?
  • CW: yes. App can get best behavior on UMA and desktop - there are cases today where you can do the optimal path is buffer mapped at creation, or update buffer in pipelined fashion during GPU execution.
  • CW: case not handled: big buffer, need to change data after creation, but not always used by the GPU. Don't know when apps would do this. Useful to think of UMA extension because it helps that case. We should already be pretty optimal in most cases though.
  • MM: think argument makes sense. Not 100% sure I agree. First statement about 2 buffer upload mechanisms is false though - there's a third, mapAtCreation. That would work when streaming data from CPU to GPU. Not the other way around though.
  • MM: backward direction is definitely less common. Need to do more research.
  • MM: other thing - we should try to describe somewhere that mappedAtCreation's expected to be more performant than creating buffer and mapping it.
  • CW: should be in non-normative text at least. Brandon made a best practices doc on uploading data with WebGPU. writeBuffer - but mappedAtCreation's pretty good, too.
  • BJ: that doc's in flux - please suggest improvements.
  • MM: committed to our repro?
  • BJ: not yet. Not a good time.
  • MM: link to it please?
  • BJ: will do. https://github.com/toji/webgpu-best-practices/blob/main/buffer-uploads.md
  • CW: think everyone wants to make UMA work amazingly well. But, amazingly hard while keeping consistent behavior from JS side, and keeping D3D constraints in mind, and single source for GPUBuffer, etc. Optimizations you want to in the browser later too. Happy to discuss details with people. Wish we had a better story for UMA, but I can't find one.
  • MM: believe you, just don't think we should say it's impossible.
  • CW: also happy to discuss offline more. Maybe in office hours.

@Kangz
Copy link
Contributor

Kangz commented Apr 25, 2022

As discussed in the meeting, moving to post-V1 polish since the only proposal so far is an optional feature.

@ErichDonGubler
Copy link
Member

@Kangz: You appear to have a broken sentence here:

I think a UMA optional feature would make a

@kainino0x
Copy link
Contributor

@Kangz: You appear to have a broken sentence here:

I think a UMA optional feature would make a

I think the gist was "UMA would make sense to put in an optional feature"

@kdashg
Copy link
Contributor

kdashg commented Jun 8, 2023

GPU Web 2023-06-07/08 (Pacific time)
  • Recap the design constraints for this problem
  • MM: wanted to touch base before going off and doing a bunch of engineering
    • We're interested in UMA working well
    • Interested in a potential solution where the same code would "do the right thing" on UMA and non-UMA
    • This group posited that that was not possible
    • I think it might be
    • Want to nail down what the original objections were
  • KR: from our side we need enga@ and cwallez@ present for the conversation. Would like to advance this on the Github issue or mailing list.
  • Postpone for a week?
  • KG: I can try to synthesize
  • KG: on non-UMA archs you sometimes need 2 copies, and on UMA you can get to 1 copy. How to pipeline, prioritizing bandwidth/latency, is where Corentin and I ran aground trying to find a single API to do both.
  • KG: My position - if you try to figure out the API for these things, you'll either prove us wrong or right, and that's great
  • MM: that's reassuring. Think we're in a different situation now than 2021. Now we have 2 ways of getting data on the card. I'd be coming back with a 3rd way. Adding a 3rd way isn't great for the platform, but if an app cares about the tradeoffs, we'd have more options for them.
  • Continue this next week.

@MikhailGorobets
Copy link

The current restriction on buffers created with map flags has problems also on NUMA architectures. Small, frequently updated uniform buffers can be stored in system memory without significant impact on performance. In addition, with the advent of Resizable BAR and SAM, it is possible to write data directly to VRAM using the CPU (we can even write textures directly to VRAM and change later the access pattern from linear to swizzled, for better bandwidth)

@kainino0x kainino0x modified the milestones: Polish post-V1, Milestone 2? Aug 15, 2023
@Kangz Kangz modified the milestones: Polish post-V1, Milestone 1 Sep 26, 2023
@kainino0x kainino0x added the api WebGPU API label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API large
Projects
None yet
Development

No branches or pull requests

8 participants