Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Fix staging buffer required size calculation (fixes #1056) #1509

Closed
wants to merge 1 commit into from

Conversation

rmsc
Copy link
Contributor

@rmsc rmsc commented Feb 23, 2021

Fix staging buffer required size calculation (fixes #1056)

The required_staging_buffer_size is currently calculated differently in two places, each will be correct in different situations:

  • prepare_staging_buffers() based on actual buffer_byte_len()
  • set_required_staging_buffer_size_to_max() based on item_size

In the case of render assets, prepare_staging_buffers() would only operate over changed assets. If some of the assets didn't change, their size wouldn't be taken into account for the required_staging_buffer_size. In some cases, this meant the buffers wouldn't be resized when they should. Now prepare_staging_buffers() is called over all assets, which may hit performance but at least gets the size right.

Shortly after prepare_staging_buffers(), set_required_staging_buffer_size_to_max() would unconditionally overwrite the previously computed value, even if using item_size made no sense. Now it only overwrites the value if bigger.

This can be considered a short term hack, but should prevent a few hard to debug panics.

@rmsc
Copy link
Contributor Author

rmsc commented Feb 23, 2021

This could also fix #1138.

@Ratysz Ratysz added P-Crash A sudden unexpected crash A-Rendering Drawing game state to the screen labels Feb 23, 2021
@rparrett
Copy link
Contributor

I'm pretty sure I also tried this approach at some point.

For what it's worth, I'm super excited to have someone else experiencing this crash and motivated to dig into the engine to try to fix it.

As far as testing this I have

https://github.com/rparrett/bevy-test/tree/render-panic

Which is a minimal test case which this seems to fix

https://github.com/rparrett/taipo/tree/spritesheet-panic-two

Which is an old snapshot of my game, modified to self-play until it crashes. This changeset seems to crash the game even sooner with a very similar panic. But it's a nightmare of forked dependencies and I'm not positive that I ended up testing this well.

https://github.com/rparrett/taipo which is currently crashing somewhat consistently, but only after over 5 minutes of play which requires being able to read some Japanese.

I'm out of time to test this one, but I'll check it out later tonight.

@rmsc
Copy link
Contributor Author

rmsc commented Feb 23, 2021

https://github.com/rparrett/taipo/tree/spritesheet-panic-two

Which is an old snapshot of my game, modified to self-play until it crashes. This changeset seems to crash the game even sooner with a very similar panic. But it's a nightmare of forked dependencies and I'm not positive that I ended up testing this well.

I've tested this a couple of times as-is (without any change), and it never crashed. The crabs just went on to attack the rock and stayed there forever. So perhaps these we are dealing with different bugs?

I've tried applying the proposed changes on top of the used version of bevy, but as soon as Cargo.lock was touched bevy_tiled stopped compiling. I've exhausted my cargo foo trying to get this to work..

https://github.com/rparrett/taipo which is currently crashing somewhat consistently, but only after over 5 minutes of play which requires being able to read some Japanese.

Got this to compile and load in the browser, but the game didn't start after hitting "English". Not sure what to try out next.

I'm out of time to test this one, but I'll check it out later tonight.

I would really appreciate it, thanks!

@rparrett
Copy link
Contributor

rparrett commented Feb 23, 2021

spritesheet-panic-two

I should mention that I believe that a high-dpi display is required to reproduce this. I've been meaning to try just doubling the font sizes on my low-dpi machine, but haven't gotten around to it.

game didn't start after hitting "English"

Not sure what could be going on there. Not something I've experienced on either machine I have access to. No errors in the javascript console? I wouldn't be surprised if there's some new-bevy-scheduler-related bug though.

But also I haven't actually tried to reproduce the panic in english-mode, because I'm assuming that large/multiple font atlases are playing a role.

@rparrett
Copy link
Contributor

Really appreciate you digging into this so much.

So I suspect that the "game doesn't start after pressing a button in the main menu" issue you're seeing could actually be the game panicking.

I got the auto-typing / auto-crashing working on the latest version of the game which works with the latest version of bevy.

https://github.com/rparrett/taipo/tree/spritesheet-panic-three (click the third menu item, "Kana + N5 + Yamanote")

And what I'm seeing is:

  • with bevy main (bc4fe9b): panic when the timer hits 9.1 seconds.
  • with this PR: panic immediately after clicking the menu item.

@rmsc rmsc marked this pull request as draft February 25, 2021 07:42
@rmsc
Copy link
Contributor Author

rmsc commented Feb 25, 2021

You're absolutely right, this is indeed a panic. I'll try to figure out what's happening this time, I probably missed something. At least now it's really quick to reproduce :)

@rmsc
Copy link
Contributor Author

rmsc commented Feb 26, 2021

I just had fairly large dose of good old humble pie. Turns out the buffer allocation logic in RenderResourceNode is a bit more complicated than I first realized.

I've revised my PR, and it now fixes both test-cases. I went over spritesheet-panic-three and now it doesn't seem to crash.

@rmsc
Copy link
Contributor Author

rmsc commented Feb 26, 2021

I've updated the PR description, it should explain what was done and why it was needed.

I must add that this is merely a workaround, and as @cart mentioned in #1208 the overall logic probably needs some overhauling.

@rmsc rmsc marked this pull request as ready for review February 26, 2021 15:20
@rparrett
Copy link
Contributor

I've updated spritesheet_panic_three to play further into the game.

Sadly, with this commit, it seems that we're just delaying the panic for about 9 seconds.

@rparrett
Copy link
Contributor

I wouldn't be surprised if there were multiple bugs producing this same panic, but the code is a bit too arcane for me to determine whether some strange thing I see in there is actually a bug.

@rmsc
Copy link
Contributor Author

rmsc commented Feb 26, 2021

I've played spritesheet_panic_three a few times, both with and without this patch, and in all cases the game went all the way to the end. So it may be that these are indeed different bugs:

  • My initial attempt fixed an existing bug A triggered by render-panic, but broke spritesheet-panic-three
  • This second attempt fixed bug A without breaking spritesheet-panic-three any further (please confirm?)
  • Neither attempts fixed bug B in spritesheet-panic-three
  • So far I wasn't able to reproduce bug B

It would really be helpful if someone could figure out how to reproduce bug B in my system. Maybe tomorrow I can try this out in another box I have.

@rparrett
Copy link
Contributor

rparrett commented Feb 26, 2021

Bizarre that you're no longer seeing panics in spritesheet_panic_three. Here's what I witnessed:

  • Tried spritesheet_panic_three with c25ee55 patched in Cargo.toml and saw no panic initially. Continued playing the game manually after autotyping ended and saw the same panic later on
  • Added rparrett/taipo@bda5dc3 which just adds some more auto-typing to reproduce that new later-panic automatically
  • With new auto-typing in place, checked with/without c25ee55 and saw both the original ~9 second timer panic and the ~0 second timer panics

I suppose I could try building on my other machine, but I'd expect these wasm builds to be fairly universal.

What's your set up over there? I'm mainly building on macos (4k display) and running the game in Chrome.

Another thing to mention is that this combination of chrome + microserver really seems to like holding onto cached wasm. And it seems common for cargo to become slightly confused and need to be cargo cleaned while testing.

There may be multiple bugs, but I think it's also super easy to make this panic temporarily go away by changing some code that results in the staging buffer just being oversized instead of undersized.

@rmsc
Copy link
Contributor Author

rmsc commented Feb 27, 2021

I was building in a Linux laptop (HD display) and using firefox. The panics I saw in spritesheet_panic_three were caused by my initial fix attempt.

I've just tried this in another Linux box with a 4K display, and I can now consistently reproduce the problem. Perhaps this is a buffer alignment bug?

I've made a few local changes to spritesheet-panic-three to enable it to compile natively. It's a lot less painful to debug that way. The only problems I'm having is matching wasm-bindgen versions. Have you considered using wasm-pack rather than manually installing wasm-bindgen-cli? Just realized that maybe bevy doesn't yet work with wasm-pack.

@rparrett
Copy link
Contributor

Okay, cool. I updated spritesheet-panic-three so that we can run it natively and we're looking at the same thing.

Improving the wasm build situation and allowing optional native builds is on my todo-list, but I've got sort of a delicate balance going on with the third party bevy plugins and I don't want to fall back into a cargo-feature-dependecy-heck again.

I did suspect some sort of buffer alignment thing early on and spent a day working that angle but didn't get anywhere. Possibly worth another look now that I'm a bit more familiar with bevy's innards.

@rparrett
Copy link
Contributor

rparrett commented Feb 28, 2021

I've updated my minimal test case to be slightly more crashy based on a previous observation in Taipo that adding another font seems to make things worse.

I added some static text that that just says "hi" in comic neue and now it's easier to produce the panic. It now crashes both of the commits in this PR.

https://github.com/rparrett/bevy-test/render-panic

I've finished up the Taipo gameplay stuff I wanted to and will hopefully be digging into this again soon.

@rmsc rmsc marked this pull request as draft February 28, 2021 16:55
@rmsc
Copy link
Contributor Author

rmsc commented Feb 28, 2021

Great news! Those changes made render-panic crash again in my laptop, so the 4k thing turned out to be a red-herring.

I think I finally figured it out. Not only were the size calculations mixed up, but there is a data race mismatch between when the buffers are resized and when they are written to!

Here's a sorted side by side comparison of buffer items at prepare_uniform_buffers() and write_uniform_buffers():

len: write 22 prepare 20
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 8
write 8, prepare 16
write 16, prepare 960
write 32, prepare 1024
write 960, prepare 1024
write 1024, prepare 1024
write 1024, prepare 1024
write 1024, prepare 1024
write 1024, prepare 1040
write 1024, prepare 1184
write 1040, prepare 1232
write 1184, None None
write 1232, None None
thread 'Compute Task Pool (0)' panicked at 'range end index 9672 out of range for slice of length 9608', /home/renato/dev/rust/bevy/crates/bevy_render/src/render_graph/nodes/render_resources_node.rs:384:26

Some items were removed and other added in between those calls (one of size 8 and another of size 32). Oops.

@rmsc rmsc marked this pull request as ready for review February 28, 2021 18:42
@rmsc
Copy link
Contributor Author

rmsc commented Feb 28, 2021

Turns out data race isn't exactly the right description. It was more of a size computation over an incomplete set.

New attempt, this time successfully fixing it in all my machines (HD and 4K). @rparrett could you please try this one?

EDIT: also updated the PR description/explanation above.

@rparrett
Copy link
Contributor

rparrett commented Mar 1, 2021

So this changeset now prevents crashes in

  • rparrett/bevy-test/render-panic The minimal repro that may or may not be completely representative of the behavior in my game
  • rparrett/taipo/spritesheet-panic-three The autoplaying version of my game
  • rparrett/taipo The "full game" that was previously crashing mid-boss-fight which was a bummer.

I've also done some very quick/non-comprehensive testing of bevy examples and I don't think anything's broken.

But I think the bar may still be somewhat high for proving that these changes are good while Cart is still occupied with bevy_ecs.

The prepare_uniform_buffers / changed_assets seems to make some sense to me, but probably results in the staging buffer being oversized when the buffer arrays are not resized? And maybe that's acceptable. I'm not familiar enough with the code to comment on the other change.

I wonder if a similar change to changed_assets further up around the code introduced by #1208 makes sense?

Maybe we can convince @mockersf to come around for third-opinion.

@rmsc
Copy link
Contributor Author

rmsc commented Mar 1, 2021

But I think the bar may still be somewhat high for proving that these changes are good while Cart is still occupied with bevy_ecs.

From what I understood from #1208 that code is bound for a rewrite sooner rather than later. So the bar may actually be pretty low right now.

The prepare_uniform_buffers / changed_assets seems to make some sense to me, but probably results in the staging buffer being oversized when the buffer arrays are not resized? And maybe that's acceptable. I'm not familiar enough with the code to comment on the other change.

This PR should change nothing in terms of buffer size, the added penalty is in terms of a few unneeded get_or_assign_index() calls.

I wonder if a similar change to changed_assets further up around the code introduced by #1208 makes sense?

Maybe we can convince @mockersf to come around for third-opinion.

I don't think it matters, because initialize() is a no-op unless self.buffer_arrays.len() != render_resources.render_resources_len(). That said I would definitely appreciate his opinion, as I'm fairly new to this code myself.

I don't know what a normal use-case looks like, but for instance in render-panic self.buffer_arrays.len() grows to 3 in the first few frames and stays there. The last resource keeps growing in size (changing), but its item_size is stuck at the initial value (256). I don't know if this is intended or not (maybe not).

@mockersf
Copy link
Member

mockersf commented Mar 1, 2021

I'm not that confident in what this code is doing exactly, but this PR change is very small, and only potential issue I see is that we are preparing buffers for assets that didn't change. This would probably send empty (but ignored) buffers to the gpu?

ideally, we should only do that for all assets if the buffer arrays have been resized, here something like

    if resized {
       // keep track if we added something new to the buffer arrays
        let mut unchanged_added = false;
        for (asset_handle, asset) in assets.iter() {
            if !changed_assets.contains_key(&asset_handle) {
                uniform_buffer_arrays.prepare_uniform_buffers(asset_handle, asset);
                unchanged_added = true;
            }
        }
        if unchanged_added {
            // Something was added, 
            uniform_buffer_arrays.resize_buffer_arrays(render_resource_context);
        }
        uniform_buffer_arrays.set_required_staging_buffer_size_to_max();
    }

after testing, calling resize_buffer_arrays again doesn't seem to do anything, so

    if resized {
        for (asset_handle, asset) in assets.iter() {
            if !changed_assets.contains_key(&asset_handle) {
                uniform_buffer_arrays.prepare_uniform_buffers(asset_handle, asset);
            }
        }

        uniform_buffer_arrays.set_required_staging_buffer_size_to_max();
    }

This create smaller buffers while still removing all crashes for me

It depends on what is costlier:

  • iterate through changed assets, and in case of resize iterate though all assets and check if it has been changed
  • iterate through all assets, and have larger buffers than needed. or at least more calls to get_or_assign_index, not sure if buffers are actually created at some point with the actual size reserved

that... I don't know. first probably costs more cpu when resizing is needed, second maybe more memory and more cpu (but less than the resize case for other possibility).

The other change in this PR (https://github.com/bevyengine/bevy/pull/1509/files#diff-454d965b87cfdc4fb81e3da21b073fe7545caa90469aa65fab5ab221db064025L219) make complete sense given the name of the method

@rparrett
Copy link
Contributor

rparrett commented Mar 2, 2021

This seems pretty difficult to benchmark, but just from looking at framerates in my own game after disabling vsync, I can't really see any difference through the noise between the current state of this PR and mockersf's suggestion.

I think this is probably ready for cart to weigh in when he has a moment.

@rmsc
Copy link
Contributor Author

rmsc commented Mar 2, 2021

I'm not that confident in what this code is doing exactly, but this PR change is very small, and only potential issue I see is that we are preparing buffers for assets that didn't change. This would probably send empty (but ignored) buffers to the gpu?

Each time asset_render_resources_node_system() runs, begin_update() sets required_stagin_buffer_size to zero.

Shortly after that, prepare_uniform_buffers() does only two things:

  1. Gets or assigns assets to the buffer array. If the asset is already there, it does nothing.
  2. Increases required_staging_buffer_size by the total size of that asset

The first part of this PR just does (1) and (2) for all assets, rather than for changed ones. (1) should normally be a no-op, so (2) keeps the staging buffer large enough to fit all assets. Otherwise the staging buffer may be too small for a full transfer, which will happen if resized.

There should be no effect on what is sent to the GPU, only in the sizing of the staging buffer. The bindings and writes are done later.

ideally, we should only do that for all assets if the buffer arrays have been resized, here something like

The end result should be about the same, but I agree. I'll update the PR.

It depends on what is costlier:

I don't think there will be any measurable performance difference here.

What may end up mattering more is how often the staging buffer is resized (and every single asset copied over to the GPU). Allocating a larger buffer means less allocations and less full copies. But resizes should be rather rare anyway.

@rmsc rmsc force-pushed the fix_1056 branch 2 times, most recently from 52cdba3 to 69c1bb1 Compare March 3, 2021 10:13
* when doing a full asset copy (resize), prepare_uniform_buffers() is
  now called on all assets rather than just changed ones. This makes
  sure the staging buffer is large enough.

* set_required_staging_buffer_size_to_max() now doesn't overwrite the
  value computed by prepare_uniform_buffers() if the resulting size
  would be smaller.

Co-authored-by: Renato Caldas <renato@calgera.com>
Co-authored-by: François <mockersf@gmail.com>
@cart
Copy link
Member

cart commented Mar 6, 2021

Nice work here. This makes sense to me as a workable stop-gap solution.

@cart
Copy link
Member

cart commented Mar 6, 2021

bors r+

bors bot pushed a commit that referenced this pull request Mar 6, 2021
Fix staging buffer required size calculation (fixes #1056)

The `required_staging_buffer_size` is currently calculated differently in two places, each will be correct in different situations:

* `prepare_staging_buffers()` based on actual `buffer_byte_len()`
* `set_required_staging_buffer_size_to_max()` based on item_size

In the case of render assets, `prepare_staging_buffers()` would only operate over changed assets. If some of the assets didn't change, their size wouldn't be taken into account for the `required_staging_buffer_size`. In some cases, this meant the buffers wouldn't be resized when they should. Now `prepare_staging_buffers()` is called over all assets, which may hit performance but at least gets the size right.

Shortly after `prepare_staging_buffers()`,  `set_required_staging_buffer_size_to_max()` would unconditionally overwrite the previously computed value, even if using `item_size` made no sense. Now it only overwrites the value if bigger.

This can be considered a short term hack, but should prevent a few hard to debug panics.
@rparrett
Copy link
Contributor

rparrett commented Mar 6, 2021

Thanks everyone! I am so very excited to un-fork bevy!

@bors
Copy link
Contributor

bors bot commented Mar 6, 2021

@bors bors bot changed the title Fix staging buffer required size calculation (fixes #1056) [Merged by Bors] - Fix staging buffer required size calculation (fixes #1056) Mar 6, 2021
@bors bors bot closed this Mar 6, 2021
@rmsc rmsc deleted the fix_1056 branch March 6, 2021 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen P-Crash A sudden unexpected crash
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error loading two different texture atlases back to back
5 participants