Tracing Infrastructure #619

kvark · 2020-04-28T01:27:04Z

Closes #289
Downstream changes: gfx-rs/wgpu-rs#291 gfx-rs/wgpu-native#17
Note: close to half of the new LOCs belong to Cargo.lock, which we are now including since we have a binary target crate player.

Why do we need that?

Have you ever seen a case that doesn't work for somebody? Often, reproducing the complete setup requires elaborate toolchains, external resources, and closed-source code. Even with all of that, the difference between platforms may hide the issue. RenderDoc or Metal captures do not help, since they are extremely driver/hardware-dependent.

This problem is amplified for WebGPU implementation in Gecko. Some workloads are rendered incorrectly, and debugging them within the browser becomes too much of a challenge.

Temporary limitations (can be fixed later):

replaying has to happen on the same backend. This is easy to work around by just mass-replacing the backend in the trace (i.e. s/Vulkan/Metal/g)
buffer readbacks are not recorded

Tracing allows us to swiftly isolate and reproduce rendering issues. We can now consider a Warden-like testing and benchmarking framework built around the traces.

What the hell it is?

This is a state of art tracing infrastructure built into wgpu-core. It traces all the commands and resource changes, very similarly to apitrace. Tracing code is enabled by a "trace" feature, and the actual support is enabled at the device instantiation, controlled at run-time. There is no run-time overhead for running with tracing feature but in disabled state.

The result of tracing is a directory, containing the following files:

"trace.ron" is basically a list of all the actions taken by the app, in RON format.
each buffer update, or a shader module created, is placed into a separate binary file referenced by the trace.

The new "player" application can be used to replay the traces. It's launched as:

target/debug/player <trace-dir>

When built with "winit" feature, it's able to replay the workloads that operate on a swapchain. It renders each frame consequently, then waits for the user to close the window. When built without "winit", it launches in console mode and can replay any trace that doesn't use swapchains.

How does it look?

Having the trace in RON makes it easy to inspect visually (by humans!). There is no separate "dump" phase. Each submission is labelled, and the trace is easy to navigate. Moreover, everything is human-editable: you can tweak it to see what changes would be needed in an application in order to fix a rendering issue. For example, you can change the blend modes and re-play.

CreateBindGroupLayout(
    id: Id(0, 1, Metal),
    label: "",
    entries: [
        (
            binding: 0,
            visibility: (
                bits: 4,
            ),
            ty: StorageBuffer,
            multisampled: false,
            has_dynamic_offset: false,
            view_dimension: D2,
            texture_component_type: Float,
            storage_texture_format: Rgb10a2Unorm,
        ),
    ],
),
CreateBindGroup(
    id: Id(0, 1, Metal),
    label: "",
    layout_id: Id(0, 1, Metal),
    entries: {
        0: Buffer(
            id: Id(1, 1, Metal),
            offset: 0,
            size: 16,
        ),
    },
),
CreatePipelineLayout(
    id: Id(0, 1, Metal),
    bind_group_layouts: [
        Id(0, 1, Metal),
    ],
),
CreateComputePipeline(
    id: Id(0, 1, Metal),
    desc: (
        layout: Id(0, 1, Metal),
        compute_stage: (
            module: Id(0, 1, Metal),
            entry_point: "main",
        ),
    ),
),
Submit(2, [
    CopyBufferToBuffer(
        src: Id(0, 1, Metal),
        src_offset: 0,
        dst: Id(1, 1, Metal),
        dst_offset: 0,
        size: 16,
    ),
    RunComputePass(
        commands: [
            SetPipeline(Id(0, 1, Metal)),
            SetBindGroup(
                index: 0,
                num_dynamic_offsets: 0,
                bind_group_id: Id(0, 1, Metal),
            ),
            Dispatch((4, 1, 1)),
        ],
        dynamic_offsets: [],
    ),
    CopyBufferToBuffer(
        src: Id(1, 1, Metal),
        src_offset: 0,
        dst: Id(0, 1, Metal),
        dst_offset: 0,
        size: 16,
    ),
]),

Alternatives considered

Q: Why not just add support to apitrace (or renderdoc) instead?
A: There is a benefit of having this Rusty: we can use heavy lifters like Serde and RON, we use pattern matching extensively. We can still consider something more independent of the language in the future, something with C API that would work for both Dawn and wgpu-native. I just don't want to invest in that now, and going for the local solution was quite straightforward.

Q: Why not slice the execution at a particular time, instead of recording everything from the beginning?
A: I think both approaches are needed, but for different purposes. Tracing allows to share and reproduce rendering issues. Also can aid in the test corpus building. Slicing vertically would help to investigate internal state problems. We can still get that in the future.

Q: Why RON and not JSON/YAML/etc?
A: Experience with WebRender proved RON to be ideal for this kind of use. Technically, there are no blockers from writing to another format - it's a matter of a single line change, plus some extra dependencies. After all, this is all going through Serde :)

monocodus

This is an autogenerated code review, no new suggestions, fix old one

The .monocodus config not found in your repo. Default config is used.
Check config documentation here

monocodus

This is an autogenerated code review, new suggestions: 3

The .monocodus config not found in your repo. Default config is used.
Check config documentation here

wgpu-core/src/hub.rs

wgpu-core/src/instance.rs

monocodus

This is an autogenerated code review, new suggestions: 13

The .monocodus config not found in your repo. Default config is used.
Check config documentation here

wgpu-core/src/device/life.rs

wgpu-core/src/device/mod.rs

aloucks · 2020-04-30T00:30:44Z

Have you considered using json or yaml for serialization instead of ron? It might be easier in the long run to incorporate other tooling if you're using a well-supported standard serialization format.

kvark · 2020-04-30T00:34:13Z

@aloucks did you read to the end? This is exactly the last item of "Alternatives considered" section, at the bottom of the description.
TL;DR: Switching this to any other Serde based format is a matter of a handful of lines, this code doesn't really depend on it in any way. If there is interest in larger interop, e.g. with Dawn using this very format, we can add YAML/JSON/whatever in a matter of minutes.

aloucks · 2020-04-30T00:40:05Z

@kvark Ah, I see that now 👍

grovesNL

Looks great overall! Thanks for doing this, I think it will greatly help debugging. There is a bit of complexity added by the #[cfg(feature = "trace")] in the middle of function bodies, generics over label, map_label, etc. but it seems worthwhile here.

player/src/main.rs

wgpu-core/src/command/allocator.rs

wgpu-core/src/device/mod.rs

wgpu-types/src/lib.rs

wgpu-core/src/swap_chain.rs

grovesNL · 2020-04-30T03:03:23Z

wgpu-types/src/lib.rs

@@ -4,13 +4,16 @@

 #[cfg(feature = "peek-poke")]
 use peek_poke::PeekPoke;
-#[cfg(feature = "serde")]
-use serde::{Deserialize, Serialize};
+#[cfg(feature = "replay")]


I think it might be confusing for people looking to enable serialization to have to enable trace or replay. Most people would probably look for a serde or serde1 feature because that's the common place to look.

For example, users may just want to serialize their types, but shouldn't necessarily need to understand the association between serialization and trace/replay (even if trace/replay happen to also require serialization).

Right, I agree. Also, I think there is value to separating those even outside of the trace/replay scope: quite often the user knows exactly which side is needed, and deriving both affects compile times quite a bit.
Perhaps, this is something to address by documentation?

grovesNL · 2020-04-30T03:12:55Z

player/src/main.rs

+    ptr,
+};
+
+macro_rules! gfx_select {


Should we expose gfx_select from wgpu-core and use that?

possibly, yes. Today, the one used in wgpu-rs is slightly different though. We may end up with small differences across the users, and the macro itself is pretty small. Let's keep an eye on it for follow-ups?

wgpu-core/src/command/transfer.rs

wgpu-core/src/device/mod.rs

kvark · 2020-04-30T13:30:38Z

bors r=grovesNL

17: Update for wgpu-core r=kvark a=kvark Depends on gfx-rs/wgpu#619 Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>

291: Update for wgpu-core r=grovesNL a=kvark Depends on gfx-rs/wgpu#619 Closes #285 Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>

291: Update for wgpu-core r=grovesNL a=kvark Depends on gfx-rs#619 Closes gfx-rs#285 Co-authored-by: Dzmitry Malyshau <kvarkus@gmail.com>