Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add particles stress test #78

Merged
merged 11 commits into from
Nov 1, 2023
Merged

Conversation

johanhelsing
Copy link
Collaborator

@johanhelsing johanhelsing commented Oct 25, 2023

Objective

We need some easy way of checking for performance regressions.

Solution

  • Add an example that spawns particles when either of the players holds down a key.
  • Pass most parameters through command-line args

Depends on #76 and #77

todo:

  • checksumming
  • figure out why it desyncs
  • command-line options for using reflect-based snapshotting instead?

@johanhelsing johanhelsing mentioned this pull request Oct 25, 2023
4 tasks
@bushrat011899
Copy link
Contributor

This looks great! Once #76 is merged I'll have a more detailed look into how this benchmark works.

@bushrat011899
Copy link
Contributor

I ran some profiling and I can see where the slowdown is coming from, the AddRollbackCommand needs exclusive world access, and each call needed to sort the whole list of Rollback components as you pointed out. I upstreamed this PR to include the changes I have since made and now that lag-spike appears to be non-existent.

@johanhelsing
Copy link
Collaborator Author

johanhelsing commented Oct 26, 2023

I added checksumming to Velocity, and unfortunately this example desyncs if I spawn quite a lot of particles.

I've also been experimenting a bit with bevy_gaff on this version. I fixed a bunch of non-determinism issues in bevy_xpbd (locally), but I'm still left with some strange rare desyncs there that I don't understand. So I'm thinking perhaps we still have a bug in bevy_ggrs?

EDIT:

Run like this to easily trigger a few rollbacks

cargo watch -cx "run --release --example particles -- --local-port 7000 --players localhost 127.0.0.1:7001 --input-delay 0 --desync-detection-interval 1 --rate 1"
cargo watch -cx "run --release --example particles -- --local-port 7001 --players 127.0.0.1:7000 localhost --input-delay 0 --desync-detection-interval 1 --rate 1"

Seems to happen on rollbacks (push or release button)

@johanhelsing johanhelsing marked this pull request as ready for review October 26, 2023 08:20
@johanhelsing
Copy link
Collaborator Author

Another interesting thing is that it only seems to happen if rollbacks actually spawn new particles. I added an N key for a no-op, it doesn't actually change anything, but still triggers a rollback because the input is different, and remapped spawning to the space key.

  • Releasing space doesn't trigger desyncs
  • Neither does pressing or releasing N
  • Pushing down space usually trigger desyncs, but not always

Copy link
Contributor

@bushrat011899 bushrat011899 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this stress test so far! It's highlighted some very interesting behaviour regarding desync detection and helped me solve at least 1 performance regression already (sort_unchecked in the previous PR)

@@ -43,3 +44,7 @@ path = "examples/box_game/box_game_spectator.rs"
[[example]]
name = "box_game_synctest"
path = "examples/box_game/box_game_synctest.rs"

[[example]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a short blurb in the README.md explaining how to run this stress test:

cargo run --release --example particles -- --local-port 7000 --players localhost 127.0.0.1:7001 --input-delay 0 --desync-detection-interval 1 --rate 1
cargo run --release --example particles -- --local-port 7001 --players 127.0.0.1:7000 localhost --input-delay 0 --desync-detection-interval 1 --rate 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a clap help section instead, which I think makes it easy to discover (and keep in sync, since it's in the source file).

dev/bevy_ggrs particles-stress-test↑1 ~1 →1 …1 cargo run --release --example particles
    Finished release [optimized] target(s) in 0.31s
     Running `target\release\examples\particles.exe`
error: the following required arguments were not provided:
  --local-port <LOCAL_PORT>

Usage: particles.exe --local-port <LOCAL_PORT>

For more information, try '--help'.
error: process didn't exit successfully: `target\release\examples\particles.exe` (exit code: 2)
dev/bevy_ggrs particles-stress-test↑1 ~1 →1 …1 cargo run --release --example particles -- --help
    Finished release [optimized] target(s) in 0.30s
     Running `target\release\examples\particles.exe --help`
Stress test for bevy_ggrs

## Basic usage:

Player 1:

cargo run --release --example particles -- --local-port 7000 --players localhost 127.0.0.1:7001

Player 2:

cargo run --release --example particles -- --local-port 7001 --players 127.0.0.1:7001 localhost

Usage: particles.exe [OPTIONS] --local-port <LOCAL_PORT>

Options:
  -l, --local-port <LOCAL_PORT>
          The udp port to bind to for this peer

  -p, --players <PLAYERS>...
          Address and port for the players. Order is significant. Put yourself as "localhost".

          e.g. `--players localhost 127.0.0.1:7001`

  -s, --spectators <SPECTATORS>...
          Address and port for any spectators

  -i, --input-delay <INPUT_DELAY>
          How long inputs should be kept before they are deployed. A low value, such as 0 will result in low latency, but plenty of rollbacks

          [default: 2]

  -d, --desync-detection-interval <DESYNC_DETECTION_INTERVAL>
          How often the clients should exchange and compare checkums of state

          [default: 10]

      --continue-after-desync
          Whether to continue after a detected desync, the default is to panic

  -n, --rate <RATE>
          How many particles to spawn per frame

          [default: 100]

  -f, --fps <FPS>
          Simulation frame rate

          [default: 60]

      --max-prediction <MAX_PREDICTION>
          How far ahead we should simulate when we don't get any input from a player

          [default: 8]

      --reflect
          Whether to use reflect-based rollback. This is much slower than the default clone/copy-based rollback

  -h, --help
          Print help (see a summary with '-h')
dev/bevy_ggrs particles-stress-test↑1 ~1 →1 …1

examples/stress_tests/particles.rs Show resolved Hide resolved
.rollback_component_with_copy::<Ttl>()
.rollback_resource_with_clone::<ParticleRng>()
.checksum_component_with_hash::<Velocity>()
// todo: ideally we'd also be doing checksums for Transforms, but that's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Perhaps a ComponentChecksumCustomHashPlugin which would accept a function fn<H: Hasher>(&C, &mut H) could be used to retroactively add hashing to types which don't support it?

}

fn update_particles(mut particles: Query<(&mut Transform, &mut Velocity)>, args: Res<Args>) {
let time_step = 1.0 / args.fps as f32; // todo: replace with bevy_ggrs resource?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this theme, I think it would be good to replace the Time resource during advance-frame with one that bevy_ggrs manually controls the time of. Then, systems already using the canonical Bevy Time resource would just automatically be rollback aware.

examples/stress_tests/particles.rs Outdated Show resolved Hide resolved
@bushrat011899
Copy link
Contributor

* Releasing space doesn't trigger desyncs

I'm still testing with this trying to work out what's going wrong and, unfortunately, I've seen it trigger desyncs purely on the release of spacebar. I've also tested using a much coarser hashing function (only hash the integer component of velocity) with no difference in desync behaviour. I suspect the issue is related to the creation or removal of rollback entities on predicted frames. I want to resolve this, since I see it as a real blocker to more advanced future functionality (e.g., de-sync resolution through resynchronisation, etc.)

@bushrat011899
Copy link
Contributor

Ok it is absolutely to do with entities only and has nothing to do with the velocity component. I created an EntityChecksumPlugin:

use std::hash::{BuildHasher, Hash, Hasher};

use bevy::prelude::*;

use crate::{ChecksumFlag, ChecksumPart, Rollback, RollbackOrdered, SaveWorld, SaveWorldSet};

pub struct EntityChecksumPlugin;

impl EntityChecksumPlugin {
    #[allow(clippy::type_complexity)]
    pub fn update(
        mut commands: Commands,
        rollback_ordered: Res<RollbackOrdered>,
        components: Query<&Rollback, (With<Rollback>, Without<ChecksumFlag<Entity>>)>,
        mut checksum: Query<&mut ChecksumPart, (Without<Rollback>, With<ChecksumFlag<Entity>>)>,
    ) {
        let mut hasher = bevy::utils::FixedState.build_hasher();

        let mut result = 0;

        for &rollback in components.iter() {
            let mut hasher = hasher.clone();

            // Hashing the rollback index ensures this hash is unique and stable
            rollback_ordered.order(rollback).hash(&mut hasher);

            // XOR chosen over addition or multiplication as it is closed on u64 and commutative
            result ^= hasher.finish();
        }

        // Hash the XOR'ed result to break commutativity with other types
        result.hash(&mut hasher);

        let result = ChecksumPart(hasher.finish() as u128);

        trace!(
            "Rollback Entities have checksum {:X}",
            result.0
        );

        if let Ok(mut checksum) = checksum.get_single_mut() {
            *checksum = result;
        } else {
            commands.spawn((result, ChecksumFlag::<Entity>::default()));
        }
    }
}

impl Plugin for EntityChecksumPlugin {
    fn build(&self, app: &mut App) {
        app.add_systems(SaveWorld, Self::update.in_set(SaveWorldSet::Checksum));
    }
}

And added it as the only system contributing to the checksum and the same desync behaviour is observed. I'm about to go to bed but that leaves 3 possibilities:

  1. Particles are being spawned when they shouldn't be.
  2. Particles aren't being despawned when they should be.
  3. The value in RollbackOrdered is not synchronised.

Writing this out, I now know it has to be 3. I think if a rollback entity is added on a predicted frame, and then is rolled back, it'll stay in the order resource and break sync with all clients. In the morning I'll make a PR that just ensures rollback entities created pre-emptively are removed from RollbackOrdered.

@bushrat011899
Copy link
Contributor

I believe I have the desync issues with this branch solved with #82! It was a combination of what I suspected (rollback entities being de/spawned during a rollback messing up the RollbackOrdered ordering) and a completely different issue where Bevy would chose a different generation value for entities spawned during a rollback, destroying Rollback ordering for all clients that had to rollback.

I tested with a rebase and couldn't induce a desync. I'm honestly shocked that hashing a f32 velocity value by its bits wasn't the cause of the desync issues!

@johanhelsing
Copy link
Collaborator Author

johanhelsing commented Oct 31, 2023

I'm honestly shocked that hashing a f32 velocity value by its bits wasn't the cause of the desync issues!

As far as I've understood, it's fine as long as we don't get any NaN values, in fact we rely on it for determininism? I guess the hasher could check for NaN and just do a deterministic bogus checksum (or warn)?

@gschup
Copy link
Owner

gschup commented Oct 31, 2023

We should definitely avoid hashing the bits of a float just like that. Special cases like NaN (both quiet and signalling) and Inf can become troublesome.

https://reference.opcfoundation.org/Core/Part6/v104/docs/5.2.2.3#:~:text=The%20floating%2Dpoint%20type%20supports,000000000000F8FF)%20or%20(0000C0FF).

@gschup
Copy link
Owner

gschup commented Oct 31, 2023

On the other hand, working with floats in Rust has proven to be surprisingly deterministic in my experience.

@johanhelsing
Copy link
Collaborator Author

johanhelsing commented Oct 31, 2023

We should definitely avoid hashing the bits of a float just like that. Special cases like NaN (both quiet and signalling) and Inf can become troublesome.

https://reference.opcfoundation.org/Core/Part6/v104/docs/5.2.2.3#:~:text=The%20floating%2Dpoint%20type%20supports,000000000000F8FF)%20or%20(0000C0FF).

I'm of two minds... aren't you in a bit of a pickle if your floats have reached inf or nan anyway? Wouldn't it be good to know?

@gschup
Copy link
Owner

gschup commented Oct 31, 2023

I'm of two minds... aren't you in a bit of a pickle if your floats have reached inf or nan anyway? Wouldn't it be good to know?

For our purposes, definitely. I think adding a NaN/Inf check just before hashing the bits should be sufficient for this example.

@johanhelsing
Copy link
Collaborator Author

For our purposes, definitely. I think adding a NaN/Inf check just before hashing the bits should be sufficient for this example.

I added an assert and a comment.

@gschup
Copy link
Owner

gschup commented Oct 31, 2023

So this stress test is ready to be merged, then?

@gschup gschup merged commit bf239c5 into gschup:main Nov 1, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants