Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Reference / Release for all objects #15

Merged
merged 1 commit into from
May 24, 2023

Conversation

Kangz
Copy link
Collaborator

@Kangz Kangz commented Oct 17, 2019

No description provided.

@Kangz Kangz requested a review from kvark October 17, 2019 12:19
@grovesNL
Copy link
Member

Is this blocked by #9?

@Kangz
Copy link
Collaborator Author

Kangz commented Oct 17, 2019

It is, we discussed it a bunch F2F I remember thinking I should do the pull request after that, but don't remember the detail of the discussion. That was a crazy week :)

@Kangz Kangz changed the base branch from master to main July 2, 2020 15:05
@pythonesque
Copy link

pythonesque commented Aug 12, 2021

I'm coming to the close of a change to switch wgpu-rs to use reference counting; at that point, both implementations should have very similar reference semantics requirements. I have worked to remove Arc wherever I can and do as little synchronization as possible internally, and I've also worked to make the lifetime requirements as low as reasonably possible in terms of overhead. I also believe that future optimizations, if applicable, will mostly be about avoiding internal reference counting during command buffer submission, which isn't going to be exposed to the user. So, here are my thoughts:

The boxed / owned objects at the end in my impl will be (from what I believe so far, and I am close to done): adapter, queue, render / command / bundle encoders, shader modules, and command buffers. These should not have exposed retain/release. Everything else should.

The rationale:

Shader-modules are only used read-only, and are not retained after pipeline layout creation (so they need no reference counting at all, not even internal reference counting).

Adapters, according to the spec, should only be used prior to device initialization and then not used again; we can take advantage of that. On native, adapters will be implicitly destroyed when they are used to create a device, therefore they don't need reference counting.

Command encoders must already be externally synchronized and are implicitly destroyed on finish. I believe this is how it works today in wgpu-rs, but if it doesn't, this is how it should work--you can't finish a command encoder more than once. So they don't need reference counting.

Command buffers, similarly, can only be used once, and you can't do anything with them except submit to a queue. So they will be implicitly destroyed on queue submission (the implementation may keep them alive for various reasons but this doesn't need to be exposed to the user, and might not be done with reference counting). Therefore, they also don't need reference counting.

Finally, queues. Queues unfortunately do need some sharing and internal synchronization; however, this is only needed for the "raw" queue. This is due to surface presentation not taking an explicit queue, so it must hold a reference to the queue internally; and the Vulkan spec requires queues to be externally synchronized for both presentation and submission. However, this should ideally only be needed when actually running the native present / submit commands. The only cases where tracking or user information needs to be updated on queues is during buffer map on init, which I would like to explicitly take a queue (there is one other place that uses the implicit device queue, which is command encoder creation and which IMO should be performed on the queue rather than the device, but that's not relevant here as it only needs read-only access to the inner queue).

So, tl;dr--I believe there is a very realistic path forward to where WebGPU external (not internal) queues don't need reference counting. So, we should not add a reference counting to it. The other main consequence of this from a C perspective is that operations that explicitly use a queue should be externally synchronized, otherwise it's basically the same as it is currently.

@pythonesque
Copy link

pythonesque commented Aug 12, 2021

Additional note from discussion with @kvark , who asked to really confirm there are no future compatibility constraints: it is possible that we could avoid reference counting on QuerySet, ComputePipeline, and RenderBundle, even though we can't do it right now, so we should also not provide inc/dec methods on that. And I haven't gotten to Surface yet, so hold off on adding inc/dec methods for that until we confirm whether an Arc free implementation is possible.

Essentially, anything that is referenced by a RenderBundle is going to have to be reference counted, because there's no conceivable future where render bundle lifetimes can be tracked by anything sane that's not tracing GC (which is a no go for C) or reference counting. This is because they're expected to live across multiple queue submissions, so there's no convenient way to figure out when they can be destroyed. This includes anything referenced by either a BindGroup or RenderPass, so basically any resource that hasn't already been explicitly mentioned (except maybe Surface--I still need to work that one out; but in any case Surface/Swapchain didn't even exist in this original proposal).

However, for ordinary command encoders, one can imagine WebGPU changing to add APIs allowing avoiding reference counting and tracking by submission index instead. QuerySets are only directly referenced by ordinary command encoders, so it is conceivable that in the future we would not need refcounting there. Similarly, render bundles can't (yet?) reference other render bundles, so it's also conceivable for them. And of course render bundles can't reference compute pipelines, so they also conceivably don't need refcounting (but... it seems pretty likely that a ComputeBundle will be added at some point, so that one seems like a technicality). So that's why if we really want to be future compatibility safe, we shouldn't have inc/dec for them.

@kainino0x
Copy link
Collaborator

For non-refcounted types, how do they get freed? Would they have a separate free method or would they simply have a single-shot release method with no reference method?

Also if we plan to omit ref/release methods for some types now that might change post 1.0, it needs to be nonbreaking to add them later.

@kainino0x
Copy link
Collaborator

All in all, having an inconsistent memory management interface across the objects seems unfortunate. I understand the reasoning behind why some objects need internal refcounting and others don't, but it's not strictly necessary that the external refcounting and the internal refcounting be the same thing. (I believe Dawn uses the same refcount field for internal and external references, but that doesn't have to be the case.) We can have internal refcounting without refcounting existing in the C API at all.

@pythonesque
Copy link

For non-refcounted types, how do they get freed? Would they have a separate free method or would they simply have a single-shot release method with no reference method?

Single-shot release.

Also if we plan to omit ref/release methods for some types now that might change post 1.0, it needs to be nonbreaking to add them later.

It would be (that's the intent of what I've been talking about).

All in all, having an inconsistent memory management interface across the objects seems unfortunate. I understand the reasoning behind why some objects need internal refcounting and others don't, but it's not strictly necessary that the external refcounting and the internal refcounting be the same thing. (I believe Dawn uses the same refcount field for internal and external references, but that doesn't have to be the case.) We can have internal refcounting without refcounting existing in the C API at all.

This is true, which is why I'm trying to differentiate between "internal refcounting that's an implementation detail" and "internal reference counting that's pretty much inevitable for any C API." For the latter, not exposing the refcounting is just strictly worse for performance, especially because we already need an allocation for every object other reasons in C . But if people would prefer the simplicity of uniform resource management for every object, and don't want refcounting exposed, that's fine too.

@pythonesque
Copy link

pythonesque commented Aug 12, 2021

And I also don't think it's super random what resources would have refcounting exposed (I'm including ComputePipeline even though strictly speaking it doesn't have a render bundle equivalent yet).

These intrinsically need refcounting (because they're transitively referenced by render bundles):

  • Device
  • ComputePipeline, RenderPipeline, PipelineLayout, BindGroupLayout
  • BindGroup, Buffer, TextureView, Texture, Sampler

And here's what doesn't intrinsically need refcounting (everything else):

  • Adapter, Surface, Queue
  • QuerySet
  • CommandEncoder, RenderBundle, RenderBundleEncoder, RenderPassEncoder, ComputePassEncoder, CommandBuffer

Subjectively, this all makes a lot of sense, and the only weird one is QuerySet (I'm not actually sure why QuerySet can't be used in a render bundle, but I'm assuming there's a good reason?). I don't think I would be terribly confused about what needs reference counting and what doesn't.

@kainino0x
Copy link
Collaborator

With the single-shot release design where we can easily add refcounting to an object later, I am relatively comfortable making it not totally uniform across the API. The difference between "refcounted" and "refcounted with a maximum count of 1" is nice and small.

Anyway, I'd like to let Corentin have the actual opinions when he's back and has time to look at this.

@kainino0x
Copy link
Collaborator

For the latter, not exposing the refcounting is just strictly worse for performance, especially because we already need an allocation for every object other reasons in C .

Just so I'm clear: What makes it strictly worse? Is it that if the client application wants to use refcounting, they'll have to use their own on top of the internal one?

@grovesNL
Copy link
Member

For now, maybe we could try to limit the amount of functions we add on top of the browser API? We could start with single-shot release/free/drop for all objects and consider exposing reference/retain functionality later.

If we wait to expose reference, we could look at some real use cases to better understand refcounting design trade-offs, e.g. typical performance improvements of shared refcounts for expected usage of WebGPU, effects on developer experience, whether many libraries would benefit from this, whether the increased API surface is worth it, etc.

@pythonesque
Copy link

pythonesque commented Aug 15, 2021

Just so I'm clear: What makes it strictly worse? Is it that if the client application wants to use refcounting, they'll have to use their own on top of the internal one?

Yes, that.

And just to be clear--I'd be totally fine not exposing these initially since it's backwards compatible to do so later. I am more saying, for the particular types I listed, there isn't really any conceivable technical decision we could make that would constrain how we could optimize the C interface, so exposing them wouldn't be a backwards compatibility hazard.

Of course as you've both mentioned, there are other concerns besides backwards compatibility here (like API surface) so something like user studies (e.g. to see how many end up wrapping the types in refcounts themselves) would be beneficial. I know a huge percentage of all wgpu-rs users do this in Rust, but a lot of that is due to parallel rendering which AFAIK isn't sound in Dawn at the moment.

@Kangz
Copy link
Collaborator Author

Kangz commented Aug 17, 2021

I think there's value in having a uniform API surface for memory management and no weird special cases as much as possible. Metal has some thing that are autoreleased and some which aren't and it has been a constant pain to deal with. Special casing encoders might be fine though.

Since most of the objects are refcounted in wgpu, especially the high-frequency ones, what is the concern with just adding refcounting to all of them? I understand minimalism is good, but I think it is dwarfed by consistency for developers. And the additional code and overhead for objects like QuerySet, ShaderModule and friends is really tiny.

Of course as you've both mentioned, there are other concerns besides backwards compatibility here (like API surface) so something like user studies (e.g. to see how many end up wrapping the types in refcounts themselves) would be beneficial. I know a huge percentage of all wgpu-rs users do this in Rust, but a lot of that is due to parallel rendering which AFAIK isn't sound in Dawn at the moment.

Refcounting has been super useful in Chromium for at least Device and Texture to handle the mess of lifetimes in Chrome's guts. Without refcounting we would have had to box the handles in the equivalent of an Rc (and most likely even Arc).

@Kangz Kangz mentioned this pull request Aug 23, 2021
@litherum
Copy link

The last comment on this is almost 3 months old. A resolution here would be helpful.

@Kangz
Copy link
Collaborator Author

Kangz commented Nov 18, 2021

To try and make forward progress I updated this PR to add reference / release for all objects except encoders that get a single delete function. I think there is value in having CommandBuffer / RenderBundle support multiple refs because they are reusable (or potentially reusable in the future) so they don't require single ownership.

@kvark
Copy link
Collaborator

kvark commented Nov 22, 2021

@litherum this issue is as old as the project itself!

What users require today is being able to delete objects. #100 adds the ability to drop objects, which addresses this need.
I think the semantics of this operation is non-controversial, only the naming is.
And naming is debated because it may or may not imply refcounting, which we haven't yet found consensus on.

Our long-standing position was that ref-counting is an implementation detail, and we'd not want it exposed to users. Some objects are refcounted in our implementation (see @pythonesque elaboration), some aren't. And even for those that are, it's easier to build solid implementation when we have full control of those reference/release internally.

So we think that we should proceed with #100 or a variation of it, since it has the semantics we all agree on, and it addresses an immediate need. Then we have this big transition internally in wgpu that needs to happen before we can conclude the refcounting story, lead by @pythonesque .

@Kangz
Copy link
Collaborator Author

Kangz commented Nov 22, 2021

And that's why we're waiting. IMO it's misleading to add "drop" to the API when we plan to revisit it in the future. Better to say "please check what your local WebGPU implementation exposes". @pythonesque described that most objects in wgpu need to be refcounted, including the most performance sensitive ones, so I don't see why we can't make the others refcounted just for convenience. Not everyone has the luxury to work with a language that has a borrow checker to work with references to stack variables everywhere.

@kvark
Copy link
Collaborator

kvark commented Nov 22, 2021

We would like to have a webgpu header that is actually usable, one that both wgpu-native and Dawn implement. The earlier, the better. Having a way to release objects is a requirement for this. So I don't think "that's why we're waiting" is the best way forward. If you don't like the "drop" method, we could agree on it being "release", just without the corresponding acquire.

@litherum
Copy link

litherum commented Nov 30, 2021

"drop" means "I'm done with it; feel free to delete it whenever you feel like deleting it"? If you "drop" something while it has an outstanding callback, will the callback be called?

@kvark
Copy link
Collaborator

kvark commented Nov 30, 2021

For us "drop" essentially means "this handle is done". That is, the object itself may very well live longer, issue pending callbacks, etc, but the user will not use this handle directly any more.

@austinEng
Copy link
Collaborator

issue pending callbacks

oh interesting, our implementation calls all the callbacks with some failure/unknown status when "drop" happens. Don't remember if there was a reason behind this right now. @Kangz ?

@benvanik
Copy link

benvanik commented Dec 3, 2021

Also hopeful for some progress here soonish - I'm trying to use webgpu.h and finding the need for this when doing things outside of one-shot hello world demos where unbounded growth is fine.

@kainino0x
Copy link
Collaborator

kainino0x commented Dec 4, 2021

"drop" means "I'm done with it; feel free to delete it whenever you feel like deleting it"? If you "drop" something while it has an outstanding callback, will the callback be called?

All callbacks must be called "eventually". This is really important because of manual memory management: an outstanding callback should be able to keep memory alive and free it on callback (even if the callback "fails"). (Related note in the last paragraph below.)


After a short discussion with Austin, Austin has me agreeing that the liveness semantics in this C API should match the JS API, so that e.g. blink::GPUBuffer exactly holds a WGPUBuffer and can drop it when GC'd.

oh interesting, our implementation calls all the callbacks with some failure/unknown status when "drop" happens. Don't remember if there was a reason behind this right now.

Initially this seemed surprising to me, but iiuc the difference just boils down to when the callback gets "failed" (called with failed status): inside drop() (== refcount reaching 0), inside destroy(), or inside ProcessEvents(). IMO in all cases it should happen in ProcessEvents. But even though it's deferred, that doesn't mean we actually have to keep the originating object alive, e.g.:

  • A WGPURequestDeviceCallback should be successfully handled even if the WGPUAdapter reference has already been dropped. This should be a common pattern in JS, and a semi common pattern in C.
  • A WGPUBufferMapCallback should be successfully handled even if the WGPUBuffer reference has already been dropped. This isn't a common pattern in any case, but should work in JS, so it should work in C. In JS, you would keep the mapping alive as long as any ArrayBuffer returned from getMappedRange is alive (no choice here because otherwise GC becomes observable). In C there is no destructor on the raw pointer returned by wgpuBufferGetMappedRange, so doing this would constitute a programming error resulting in a memory leak.

IMO it should never be "valid" (defined behavior) for any wgpu function to be called with any WGPU object argument that has been dropped by the client program. (In Dawn for example I would like a runtime debug ASSERT for this (because due to implementation details the pointer technically stays valid as long as there are internal refs) - this way ownership errors are reliably caught in Debug+ASAN builds.)

Therefore without a ref to WGPUInstance you can't call wgpuInstanceProcessEvents, so you wouldn't be able to get any callbacks to be called. It doesn't really matter whether any objects have internal refs back to WGPUInstance.

Since we guarantee all callbacks will be called, upon dropping the last ref to WGPUInstance all callbacks should be failed. (This is better than debug asserting there are no pending callbacks, because it's not easy to know whether there are any.)

@kainino0x
Copy link
Collaborator

Oh yeah, and a further thing: since I think C drop should be 1:1 with JS object garbage collection, there should be no observable effects from dropping any object. For example:

  • Dropping an object shouldn't result in callback failure
  • Dropping a device shouldn't destroy a buffer which would result in map callback failure

Dropping instances may be a special case. It's impossible in JS so it doesn't present issues implementing the JS API on top of the C API. And if you drop the instance you can't get any callbacks so relatively little is observable anyway. (The return values of wgpuBufferGetMappedRange and create/finish calls are examples of things that are still observable.)

@kainino0x
Copy link
Collaborator

kainino0x commented Dec 4, 2021

In JS, you would keep the mapping alive as long as any ArrayBuffer returned from getMappedRange is alive (no choice here because otherwise GC becomes observable).

Which would be implemented by having the ArrayBuffer shared-own its underlying memory, either indirectly by shared-owning a WGPUBuffer, or by some other mechanism (like shared-owning the IPC shared memory).

@Kangz
Copy link
Collaborator Author

Kangz commented Dec 6, 2021

+1 to what Kai explained. Ideally webgpu.h implements the JS semantics where you can't observe GC happening so that callbacks don't get called on user-side drop, but only when they would have normally been called. (JS calling .destroy can cancel some callbacks, but that's well defined and not exposing GC).

The Unknown was also to deal with the network transparency of WebGPU. If the GPU process gets disconnected, the callbacks should still be called, but with Unknown. Though maybe it could just be DeviceLost for both the actual device loss and the GPU process crashing.

@litherum
Copy link

Are there any updates? It's been about 4 months.

@Kangz
Copy link
Collaborator Author

Kangz commented Mar 30, 2022

FYI @jimblandy, this is the issue I mentioned.

@Kangz
Copy link
Collaborator Author

Kangz commented May 22, 2023

@kainino0x @cwfitzgerald PTAL again at the last commit there!

@kainino0x kainino0x removed the request for review from kvark May 22, 2023 22:53
Copy link
Collaborator

@kainino0x kainino0x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit LGTM, one comment on earlier one EDIT: wrong PR for that comment

kainino0x
kainino0x approved these changes May 22, 2023
@kainino0x
Copy link
Collaborator

Clean rebased past #182, landing now

@kainino0x kainino0x merged commit 2451303 into webgpu-native:main May 24, 2023
@Kangz Kangz deleted the ref-release branch May 24, 2023 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifetimes Lifetimes of object and memory allocations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants