Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device valid #4163

Merged
merged 45 commits into from
Sep 27, 2023
Merged

Device valid #4163

merged 45 commits into from
Sep 27, 2023

Conversation

bradwerth
Copy link
Contributor

Checklist

  • Run cargo clippy.
  • Run cargo clippy --target wasm32-unknown-unknown if applicable.
  • Add change to CHANGELOG.md. See simple instructions inside file.

Connections
None.

Description
This implements a simple version of "lose the device", which marks a Device as invalid, causing all subsequent operations to return an error. Nothing actually calls this yet.

Testing
This adds an integration test device_lose_then_more that attempts various operations on a lost device, which are all expected to fail.

bradwerth and others added 30 commits September 11, 2023 16:44
This tracks Device validity explicity, but it could be changed to instead
change Device Storage to Error when the Device is invalid. This also could
go further with standardizing the error reporting for an invalid Device.
This tracks device validity explicitly, and short-circuits almost all
operations on devices when they are invalid. This doesn't replace the
storage of the device with an Error, because it's important for
poll_devices to be able to monitor the devices (even if invalid) for
emptiness so they can be cleaned up.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Connor Fitzgerald <connorwadefitzgerald@gmail.com>
Co-authored-by: Connor Fitzgerald <connorwadefitzgerald@gmail.com>
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.47 to 1.0.48.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](dtolnay/thiserror@1.0.47...1.0.48)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde](https://github.com/serde-rs/serde) from 1.0.186 to 1.0.188.
- [Release notes](https://github.com/serde-rs/serde/releases)
- [Commits](serde-rs/serde@v1.0.186...v1.0.188)

---
updated-dependencies:
- dependency-name: serde
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…fx-rs#4115)

Rename `wgpu_hal::vulkan::Instance::required_extensions` to
`desired_extensions`, to match its behavior. Document the function to
clarify its role.
Due to an issue with Mesa versions less than 21.2 presentation on Vulkan
was forced to Nvidia only. This in itself brought new issues around the
Nvidia driver specfic format modifers.

As of Mesa 21.2 the Intel vulkan issue is fixed. This commit enables
presentation on versions 21.2 and above for Intel.

References:
- NVIDIA/egl-wayland#72

Closes [gfx-rs#4101](gfx-rs#4101)
For each backend `blah`, log::debug/trace whether we were able to
populate `Instance::blah`.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.105 to 1.0.106.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](serde-rs/json@v1.0.105...v1.0.106)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nicolas Silva <nical@fastnmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Implements timer queries via write_timestamp on Metal for encoders (whenever timer queries are available) and passes (for Intel/AMD GPUs, where we should advertise TIMESTAMP_QUERY_INSIDE_PASSES now).

Due to some bugs in Metal this was a lot harder than expected. I believe the solution is close to optimal with the current restrictions in place. For details see code comments.
Co-authored-by: Connor Fitzgerald <connorwadefitzgerald@gmail.com>
Bumps [profiling](https://github.com/aclysma/profiling) from 1.0.10 to 1.0.11.
- [Changelog](https://github.com/aclysma/profiling/blob/master/CHANGELOG.md)
- [Commits](https://github.com/aclysma/profiling/commits)

---
updated-dependencies:
- dependency-name: profiling
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [termcolor](https://github.com/BurntSushi/termcolor) from 1.2.0 to 1.3.0.
- [Commits](BurntSushi/termcolor@1.2.0...1.3.0)

---
updated-dependencies:
- dependency-name: termcolor
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! But I fear this might be the wrong general direction for implementing device loss 😟

According to spec, entering the device-lost state is not directly exposed other than via a destroy call, but otherwise is an error state the device can enter.
https://www.w3.org/TR/webgpu/#lose-the-device

There is however a lost promise that we need to implement here, MDN has some example code on this what it should look like from JS https://developer.mozilla.org/en-US/docs/Web/API/GPUDevice/lost

I think we should try to handle device lost states through the HAL layer - that's where they typically occur. Device loss via destroy is harder to map to that ofc, but we should investigate if we can "poison" the hal device so that this actually has the desired effect through the entire stack.

@nical
Copy link
Contributor

nical commented Sep 22, 2023

I think we should try to handle device lost states through the HAL layer - that's where they typically occur. Device loss via destroy is harder to map to that ofc, but we should investigate if we can "poison" the hal device so that this actually has the desired effect through the entire stack.

I think that we should be able to implement most of device.destroy and device loss via the same infrastructure in wgpu-core, it would be simple to keep the associated validation there. Also, all of the asynchronous deallocation logic is in core so I don't think we can escape having the destroyed state handle there.

What advantages do you expect from moving it in hal?

@Wumpf
Copy link
Member

Wumpf commented Sep 22, 2023

Generally, a device loss event is an actual thing that happens to the hal device; different implementations have different error states. Relatedly, I think destroying/loosing a device should actually have an actual effect that is communicated with the driver.

@Wumpf
Copy link
Member

Wumpf commented Sep 22, 2023

I fear if we put too much logic in wgpu-core, we have two sources of device-lost state: The hal device that is actually in a bad state / deallocated / etc. and some flag on wgpu-core. So to keep this in sync we could instead solely rely on hal owning this state.

@nical
Copy link
Contributor

nical commented Sep 22, 2023

It looks like you are looking at this primarily through the lense of device loss, while Brad's priority here is to get device.destroy working (we direly need it for CTS runs to not randomly OOM). The effect of that will be to clean up all of the device's allocated resources (while the handles are still alive, hence the need for validation everywhere which is in core) but that's not reflected in the PR yet.

I haven't looked at the PR in details yet but perhaps some of the "lose" terminology in there could be replaced by destroy to make that clearer.

We could store the device's state in the hal structure, but it would still be mostly accessed by the validation in wgpu-core.

@bradwerth
Copy link
Contributor Author

This is definitely a partial implementation of "lose the device" which is itself a partial implementation of device.destroy. From my read of the spec, we lose the device in 3 ways:

  1. It's the final steps of device.destroy.
  2. As needed from each backend.
  3. As part of the error response in adapter.request_device. This case seems least important since it already returns an error and not a device, so there's nothing for the caller to check a lost promise against. But if there's internal state cleanup associated with "lose the device" it happens in this case, too.

In this patch, my goal was to support these cases at least in theory, but since the integration test is checking the outcome of lose the device, it makes it look like a new callable function on the device object itself. It also doesn't attempt to handle the lost promise resolution for two reasons: because of my lack of familiarity with the direct and web protocols, and the implementation in Firefox will be external to wgpu. Regarding the direct and web implementations of the lost promises, I should either add much better TODOs to document the gaps, or fill the gaps with an implementation.

@jimblandy
Copy link
Member

@Wumpf I think you're right to ask how this code is going to interact with device loss reported from the underlying platform APIs. This PR needs to include a clear plan for how that is going to be incorporated.

As @nical said, our immediate goal is to implement something corresponding to the spec's GPUDevice.destroy.

However, in doing so we've run into an unfortunate part of wgpu's current architecture.

When the hal reports a lost device, WebGPU says we should execute the "lose the device" steps you mentioned above. The first step there is "Make device invalid."

Everywhere else in wgpu-core treats WebGPU invalidity by placing an Element::Error in the appropriate Registry field of the Hub. However, that was designed for objects that are invalid from the point of creation, not objects that become invalid after having previously been valid. If we were to implement GPUDevice.destroy by removing the Device from the Hub, it would be impossible for wgpu-core to continue to process the queue, as required by steps 4 and 5 of "lose the device":

  1. Complete any outstanding mapAsync() steps.

  2. Complete any outstanding onSubmittedWorkDone() steps.

In gpuweb/gpuweb#4294 I confirmed with @Kangz that we're not supposed to be able to make further requests of the device after losing it, so it does seem important that the device be marked invalid at the indicated point in execution.

@bradwerth explained this to me, and said that adding the valid flag to the wgpu_core Device seemed like it would be required in order to prevent the Device from handling new user requests, while still being able to process the queue. We agreed it was ugly and unfortunate, but we couldn't see a better way to do it.

I believe #3626 will change wpgu-core to use Arc for all internal references to devices, so when that lands we may be able to remove the valid flag, and implement "lose the device" simply by removing the Device's entry from its Registry. Then all API requests will automatically be unable to access the Device, and we can remove the valid flag.

@bradwerth
Copy link
Contributor Author

Presuming that the wgpu-core implementation of "lose the device" can be correctly signaled by the driver, then I don't think there is any danger of inconsistent state. Since wgpu-core device has the valid/invalid state, and prevents further requests from reaching the hal device, then I don't believe there is double-state that can become inconsistent. The hal device is never again accessed.

@jimblandy
Copy link
Member

jimblandy commented Sep 22, 2023

Presuming that the wgpu-core implementation of "lose the device" can be correctly signaled by the driver, then I don't think there is any danger of inconsistent state. Since wgpu-core device has the valid/invalid state, and prevents further requests from reaching the hal device, then I don't believe there is double-state that can become inconsistent. The hal device is never again accessed.

This isn't quite right. For example, Vulkan requires that all the resources created from a device be freed before the device itself is freed, and this requirement holds even if vkQueueSubmit returns VK_ERROR_DEVICE_LOST. The architecture of wgpu is that wgpu-hal is not responsible for tracking a device's resources and cleaning them up; enforcing anything but the simplest invariants is wgpu-core's role. So when wgpu_hal::vulkan::Queue::submit calls ash::device::Device::queue_submit (the Rust binding for vkQueueSubmit) and that returns vk::Result::ERROR_DEVICE_LOST, it just returns it directly. Then it's up to wgpu_core to go about cleaning up the resources - and that will definitely entail continuing to use the hal device.

It looks to me like wgpu_core doesn't implement this cleanup, however. EDIT: Operations will continue to fail after device loss, and resources get cleaned up when the user drops the device.

Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All very good arguments why we should get this started with a validity flag. Alright, let's do it then and evolve it as we add lost device handling on HAL!

However, (digging a bit more into the code now) we should try a bit harder not to expose device_lose on the user facing wgpu layer as the spec doesn't include it. I'm also very skeptical of having it as a non public function on wgpu-core (wgpu-core is usually just an implementation of the public interface), but I can see how that might be useful in some places.
It seems pretty clear that destroy is what serves the actual purpose for the user. If the desire is to move fast on giving the user the ability to cause a device loss, we should instead expose destroy already and note down that its implementation is still work in progress - causing the device invalidation but no other sideeffects like resource destruction yet :)
Hope that makes sense!

CHANGELOG.md Outdated Show resolved Hide resolved
.cargo/config.toml Outdated Show resolved Hide resolved
tests/tests/device.rs Outdated Show resolved Hide resolved
tests/tests/device.rs Show resolved Hide resolved
tests/tests/device.rs Show resolved Hide resolved
wgpu/src/lib.rs Outdated Show resolved Hide resolved
wgpu/src/backend/web.rs Outdated Show resolved Hide resolved
wgpu-core/src/device/global.rs Show resolved Hide resolved
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! Thanks for adding all those verbose & helpful comments in there!

@Wumpf Wumpf merged commit 57f8757 into gfx-rs:trunk Sep 27, 2023
20 checks passed
@bradwerth bradwerth deleted the deviceValid branch September 27, 2023 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.