[Merged by Bors] - Reliable change detection #1471

alice-i-cecile · 2021-02-18T20:28:16Z

Problem Definition

The current change tracking (via flags for both components and resources) fails to detect changes made by systems that are scheduled to run earlier in the frame than they are.

This issue is discussed at length in #68 and #54.

This is very much a draft PR, and contributions are welcome and needed.

Criteria

Each change is detected at least once, no matter the ordering.
Each change is detected at most once, no matter the ordering.
Changes should be detected the same frame that they are made.
Competitive ergonomics. Ideally does not require opting-in.
Low CPU overhead of computation.
Memory efficient. This must not increase over time, except where the number of entities / resources does.
Changes should not be lost for systems that don't run.
A frame needs to act as a pure function. Given the same set of entities / components it needs to produce the same end state without side-effects.

Exact change-tracking proposals satisfy criteria 1 and 2.
Conservative change-tracking proposals satisfy criteria 1 but not 2.
Flaky change tracking proposals satisfy criteria 2 but not 1.

Code Base Navigation

There are three types of flags:

Added: A piece of data was added to an entity / Resources.
Mutated: A piece of data was able to be modified, because its DerefMut was accessed
Changed: The bitwise OR of Added and Changed

The special behavior of ChangedRes, with respect to the scheduler is being removed in #1313 and does not need to be reproduced.

ChangedRes and friends can be found in "bevy_ecs/core/resources/resource_query.rs".

The Flags trait for Components can be found in "bevy_ecs/core/query.rs".

ComponentFlags are stored in "bevy_ecs/core/archetypes.rs", defined on line 446.

Proposals

Proposal 5 was selected for implementation.

Proposal 0: No Change Detection

The baseline, where computations are performed on everything regardless of whether it changed.

Type: Conservative

Pros:

already implemented
will never miss events
no overhead

Cons:

tons of repeated work
doesn't allow users to avoid repeating work (or monitoring for other changes)

Proposal 1: Earlier-This-Tick Change Detection

The current approach as of Bevy 0.4. Flags are set, and then flushed at the end of each frame.

Type: Flaky

Pros:

already implemented
simple to understand
low memory overhead (2 bits per component)
low time overhead (clear every flag once per frame)

Cons:

misses systems based on ordering
systems that don't run every frame miss changes
duplicates detection when looping
can lead to unresolvable circular dependencies

Proposal 2: Two-Tick Change Detection

Flags persist for two frames, using a double-buffer system identical to that used in events.

A change is observed if it is found in either the current frame's list of changes or the previous frame's.

Type: Conservative

Pros:

easy to understand
easy to implement
low memory overhead (4 bits per component)
low time overhead (bit mask and shift every flag once per frame)

Cons:

can result in a great deal of duplicated work
systems that don't run every frame miss changes
duplicates detection when looping

Proposal 3: Last-Tick Change Detection

Flags persist for two frames, using a double-buffer system identical to that used in events.

A change is observed if it is found in the previous frame's list of changes.

Type: Exact

Pros:

exact
easy to understand
easy to implement
low memory overhead (4 bits per component)
low time overhead (bit mask and shift every flag once per frame)

Cons:

change detection is always delayed, possibly causing painful chained delays
systems that don't run every frame miss changes
duplicates detection when looping

Proposal 4: Flag-Doubling Change Detection

Combine Proposal 2 and Proposal 3. Differentiate between JustChanged (current behavior) and Changed (Proposal 3).

Pack this data into the flags according to this implementation proposal.

Type: Flaky + Exact

Pros:

allows users to acc
easy to implement
low memory overhead (4 bits per component)
low time overhead (bit mask and shift every flag once per frame)

Cons:

users must specify the type of change detection required
still quite fragile to system ordering effects when using the flaky JustChanged form
cannot get immediate + exact results
systems that don't run every frame miss changes
duplicates detection when looping

[SELECTED] Proposal 5: Generation-Counter Change Detection

A global counter is increased after each system is run. Each component saves the time of last mutation, and each system saves the time of last execution. Mutation is detected when the component's counter is greater than the system's counter. Discussed here. How to handle addition detection is unsolved; the current proposal is to use the highest bit of the counter as in proposal 1.

Type: Exact (for mutations), flaky (for additions)

Pros:

low time overhead (set component counter on access, set system counter after execution)
robust to systems that don't run every frame
robust to systems that loop

Cons:

moderately complex implementation
must be modified as systems are inserted dynamically
medium memory overhead (4 bytes per component + system)
unsolved addition detection

Proposal 6: System-Data Change Detection

For each system, track which system's changes it has seen. This approach is only worth fully designing and implementing if Proposal 5 fails in some way.

Type: Exact

Pros:

exact
conceptually simple

Cons:

requires storing data on each system
implementation is complex
must be modified as systems are inserted dynamically

Proposal 7: Total-Order Change Detection

Discussed here. This proposal is somewhat complicated by the new scheduler, but I believe it should still be conceptually feasible. This approach is only worth fully designing and implementing if Proposal 5 fails in some way.

Type: Exact

Pros:

exact
efficient data storage relative to other exact proposals

Cons:

requires access to the scheduler
complex implementation and difficulty grokking
must be modified as systems are inserted dynamically

Tests

We will need to verify properties 1, 2, 3, 7 and 8. Priority: 1 > 2 = 3 > 8 > 7
Ideally we can use identical user-facing syntax for all proposals, allowing us to re-use the same syntax for each.
When writing tests, we need to carefully specify order using explicit dependencies.
These tests will need to be duplicated for both components and resources.
We need to be sure to handle cases where ambiguous system orders exist.

changing_system is always the system that makes the changes, and detecting_system always detects the changes.

The component / resource changed will be simple boolean wrapper structs.

Basic Added / Mutated / Changed

2 x 3 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs before detecting_system
verify at the end of tick 2

At Least Once

2 x 3 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs after detecting_system
verify at the end of tick 2

At Most Once

2 x 3 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs once before detecting_system
increment a counter based on the number of changes detected
verify at the end of tick 2

Fast Detection

2 x 3 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs before detecting_system
verify at the end of tick 1

Ambiguous System Ordering Robustness

2 x 3 x 2 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs [before/after] detecting_system in tick 1
changing_system runs [after/before] detecting_system in tick 2

System Pausing

2 x 3 design:

Resources vs. Components
Added vs. Changed vs. Mutated
changing_system runs in tick 1, then is disabled by run criteria
detecting_system is disabled by run criteria until it is run once during tick 3
verify at the end of tick 3

Addition Causes Mutation

2 design:

Resources vs. Components
adding_system_1 adds a component / resource
adding system_2 adds the same component / resource
verify the Mutated flag at the end of the tick
verify the Added flag at the end of the tick

First check tests for: #333
Second check tests for: #1443

Changes Made By Commands

adding_system runs in Update in tick 1, and sends a command to add a component
detecting_system runs in Update in tick 1 and 2, after adding_system
We can't detect the changes in tick 1, since they haven't been processed yet
If we were to track these changes as being emitted by adding_system, we can't detect the changes in tick 2 either, since detecting_system has already run once after adding_system :(

Benchmarks

See: general advice, Criterion crate

There are several critical parameters to vary:

entity count (1 to 10^9)
fraction of entities that are changed (0% to 100%)
cost to perform work on changed entities, i.e. workload (1 ns to 1s)

1 and 2 should be varied between benchmark runs. 3 can be added on computationally.

We want to measure:

memory cost
run time

We should collect these measurements across several frames (100?) to reduce bootup effects and accurately measure the mean, variance and drift.

Entity-component change detection is much more important to benchmark than resource change detection, due to the orders of magnitude higher number of pieces of data.

No change detection at all should be included in benchmarks as a second control for cases where missing changes is unacceptable.

Graphs

y: performance, x: log_10(entity count), color: proposal, facet: performance metric. Set cost to perform work to 0.
y: run time, x: cost to perform work, color: proposal, facet: fraction changed. Set number of entities to 10^6
y: memory, x: frames, color: proposal

Conclusions

Is the theoretical categorization of the proposals correct according to our tests?
How does the performance of the proposals compare without any load?
How does the performance of the proposals compare with realistic loads?
At what workload does more exact change tracking become worth the (presumably) higher overhead?
When does adding change-detection to save on work become worthwhile?
Is there enough divergence in performance between the best solutions in each class to ship more than one change-tracking solution?

Implementation Plan

Write a test suite.
Verify that tests fail for existing approach.
Write a benchmark suite.
Get performance numbers for existing approach.
Implement, test and benchmark various solutions using a Git branch per proposal.
Create a draft PR with all solutions and present results to team.
Select a solution and replace existing change detection.

alice-i-cecile · 2021-02-18T20:28:55Z

Results

Placeholder post so future readers can quickly read what the conclusion was.

Davier · 2021-02-27T05:33:47Z

There seems to be an issue with the overflow handling that was proposed for the counters in #68.
The idea is to detect and correct the global counter overflow when the difference between a system counter and a component counter is larger than u32::MAX / 2, but that obviously fails when these counters are genuinely that far apart.
As previously mentioned, this can be ignored for the case of old system counters, by specifying that any system that does change detection must run at least once for every u32::MAX / 2 systems run, otherwise false negatives may happen.
However, this is problematic for old component counters, because they create false positives instead. A lot of components are never mutated. More importantly, added counters are never changed by design, so the problem will happen after u32::MAX / 2 systems are run, and will create false positives every time until u32::MAX / 2 more systems are run.

I have a solution in mind but it seems bad: after every u32::MAX / 4 global counts, iterate all the component and system counters and increase them by u32::MAX / 4 if they are older than u32::MAX * 3 / 4. That way, there can be no false positive, only false negatives when systems are not run for u32::MAX / 2 counts.

Davier · 2021-02-27T05:37:55Z

A lot of false positives can be avoided by checking that the component counter is older than the global counter, but that still doesn't prevent all of them.

nebbishhacker · 2021-02-28T18:36:22Z

I have a solution in mind but it seems bad: after every u32::MAX / 4 global counts, iterate all the component and system counters and increase them by u32::MAX / 4 if they are older than u32::MAX * 3 / 4. That way, there can be no false positive, only false negatives when systems are not run for u32::MAX / 2 counts.

There's an even more thorough solution: when the global count in in danger of overflowing u32::MAX, sort all the counters in ascending order and renumber them from zero.

The logic would look something like this:

// Only trigger if the global counter is going to overflow this frame
if global_counter >= u32::MAX - (systems.len() as u32) {
    let mut counters: Vec<&mut u32> = iter::once(&mut global_counter)
        .chain(components.iter_mut().map(|c| &mut c.counter))
        .chain(systems.iter_mut().map(|s| &mut s.counter))
        .collect();
    counters.sort();

    let mut current_num = 0;
    let mut last_counter = *counters[0];
    for counter in counters {
        if *counter != last_counter {
            current_num += 1;
            last_counter = *counter;
        }

        *counter = current_num;
    }
}

This should prevent the global counter from overflowing in the first place whilst preserving the ordering of the counters, thus preventing both false positives and false negatives. It's a little bit more expensive than your solution, but the cost is probably acceptable given how rarely it would trigger.

Davier · 2021-03-01T06:59:59Z

There's an even more thorough solution: when the global count in in danger of overflowing u32::MAX, sort all the counters in ascending order and renumber them from zero.

It would remove the need to do any bookkeeping at the end of each frame so it's worth considering, but I'm not sold on it since it would probably consistently produce a lag spike. In large applications, especially if some part of the schedule is not frame limited, that could happen every couple of hours.

alice-i-cecile · 2021-03-01T08:35:26Z

There's an even more thorough solution: when the global count in in danger of overflowing u32::MAX, sort all the counters in ascending order and renumber them from zero.

It would remove the need to do any bookkeeping at the end of each frame so it's worth considering, but I'm not sold on it since it would probably consistently produce a lag spike. In large applications, especially if some part of the schedule is not frame limited, that could happen every couple of hours.

Could we precompute much of its work and spread it out across a few frames in advance? It's not like this overflow is likely to come as a surprise.

YohDeadfall · 2021-03-01T12:45:46Z

Not a game developer, so has not so much knowledge in that area and might be wrong, but if correctly understand systems run each frame and are triggered from a single thread. Therefore, each frame the global counter's value might be saved and used to determine the starting point and mitigate overflows.

Unfortunately, if some component won't be modified for a very long time it might appear as mutated, but I don't think that it's a case since if there are 1000 systems at 60 fps rate then it means the component will be untouched about 19 hours.

static GLOBAL_COUNTER_PREV: u32 = 1;
static GLOBAL_COUNTER: AtomicU32 = AtomicU32::new(GLOBAL_COUNTER_PREV);

fn on_component_mutation() {
    component.counter = GLOBAL_COUNTER;
}

fn is_component_mutated() -> bool {
    let d = system.counter.wrapping_sub(GLOBAL_COUNTER_PREV);
    let i = component.counter.wrapping_sub(GLOBAL_COUNTER_PREV);

    i > d
}

fn after_run_system() {
    GLOBAL_COUNTER += 1;
    system.counter = GLOBAL_COUNTER;
}

fn after_run() {
    GLOBAL_COUNTER_PREV = GLOBAL_COUNTER;
}

nebbishhacker · 2021-03-01T16:17:48Z

Could we precompute much of its work and spread it out across a few frames in advance? It's not like this overflow is likely to come as a surprise.

One approach I can think of would be to make a copy of all the counters, sort and renumber them on a worker thread, and build a mapping from the old numbering to the new numbering which can then be applied later on the main thread. Any counters that are updated while this is happening would be adjusted via a simple subtraction.

// on the main thread
let mut counters: Vec<u32> = iter::once(global_counter)
    .chain(components.iter().map(|c| c.counter))
    .chain(systems.iter().map(|s| s.counter))
    .collect();
let highest_counter = global_counter;

// on a background thread
counters.sort();
counters.dedup();
let renumbered_counter_map: HashMap<u32, u32> = counters.into_iter().zip(0..).collect();
let adjustment = highest_counter - renumbered_counter_map[&highest_counter];

// back on the main thread in a later frame
for counter in iter::once(&mut global_counter)
    .chain(components.iter_mut().map(|c| &mut c.counter))
    .chain(systems.iter_mut().map(|s| &mut s.counter))
{
    match renumbered_counter_map.get(counter) {
        Some(&c) => *counter = c,
        None => *counter -= adjustment,
    }
}

Unfortunately, if some component won't be modified for a very long time it might appear as mutated, but I don't think that it's a case since if there are 1000 systems at 60 fps rate then it means the component will be untouched about 19 hours.

I believe that's the precise issue @Davier was concerned about. It's also worth keeping in mind that not every game is capped to 60fps - a well optimized game running on a powerful machine with no frame caps can easily end up pushing 1000+fps, which multiplied by 1000 systems would lead to an overflow in a bit over an hour.

YohDeadfall · 2021-03-01T17:03:40Z

@nebbishhacker, when does checking of an component for changes happen? If I would know more details about it, I might help since what you're doing right now is a casual row visibility control which is done by database engines. For example, PostgreSQL uses unsigned 32-bit integer for transactions and future overflows by executing VACUUM command.

Mine idea on it is to reset component counters while checking them since you're aware of an overflow and already iterating over components to detect changes, so you even can avoid writing back to RAM if the component's counter was overflew too.

Otherwise, you can mark each component type to be updated in the future when values are requested. In that case if a component is so rarely updated that the global counter might overflow at least once, then triggering an update of its counters doesn't worth the time and can be postponed. Or you can do it by fixing them in background by testing values for being greater than the global counter and using compare exchange to write them back to RAM. Compare exchange isn't a cheap operation, but since it's on a background thread, performance won't degrade for other threads.

bjorn3 · 2021-03-01T17:23:31Z

Mine idea on it is to reset component counters while checking them

If you reset the counters, you have to reset them for all components and systems at the same time for change detection to work reliably.

Or you can do it by fixing them in background by testing values for being greater than the global counter and using compare exchange to write them back to RAM.

Doing it in the background won't work as that would cause a race condition between updating the counter for a system and for all components it accesses, causing changes to be spuriously reported or not reported at all.

YohDeadfall · 2021-03-01T18:59:07Z

If you reset the counters, you have to reset them for all components and systems at the same time for change detection to work reliably.

Not following. If some component wasn't modified by the system its value is guarantied to be equal or less than the system's one. And to prevent false positives there's a second hidden condition to test the counter against GLOBAL_COUNTER, it works implicitly thankfully to wrapping_sub:

let global = 10u32;
let global_prev = u32::MAX - 10;

let component_a = global_prev.wrapping_add(5);
let component_b = global_prev.wrapping_add(10);
let component_c = global_prev.wrapping_add(15);

let system_x = global_prev.wrapping_add(6);
let system_y = global_prev.wrapping_add(12);

println!("A: {}", component_a);
println!("B: {}", component_b);
println!("C: {}", component_c);
println!();

println!("X: {}", system_x);
println!("Y: {}", system_y);
println!();

println!("Was A changed for X? {}", changed(global_prev, system_x, component_a));
println!("Was B changed for X? {}", changed(global_prev, system_x, component_b));
println!("Was C changed for X? {}", changed(global_prev, system_x, component_c));
println!("Was A changed for Y? {}", changed(global_prev, system_y, component_a));
println!("Was B changed for Y? {}", changed(global_prev, system_y, component_b));
println!("Was C changed for Y? {}", changed(global_prev, system_y, component_c));

fn changed(g: u32, s: u32, c: u32) -> bool {
    let d = s.wrapping_sub(g);
    let i = c.wrapping_sub(g);

    i > d
}

A: 4294967290
B: 4294967295
C: 4

X: 4294967291
Y: 1

Was A changed for X? false
Was B changed for X? true
Was C changed for X? true
Was A changed for Y? false
Was B changed for Y? false
Was C changed for Y? true

As you can find nor A neither B are modified for Y. It proves that even if some counter has a greater value than the system, it still can be unchanged for it because of the adjusted value comparison.

Doing it in the background won't work as that would cause a race condition between updating the counter for a system and for all components it accesses, causing changes to be spuriously reported or not reported at all.

Something like this should work since it involves a double check:

// don't care about value the global counter stores
// overflow already happened and our goal to update
// counters to the last know good value after the event
// so `new` will store value after the overflow, but it will
// be less than the current global counter's value and
// all updated components won't be marked as changed
let new = GLOBAL_COUNTER_PREV;
let current = component.counter.load(Ordering::Acquire);

if current > new {
    // if someone changed the value in parallel exchange will fail
    // in that case ignore the component and move on to the next
    component.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
}

And even if that code will update counters in background and there will be other running systems, there's a plenty of time to update them concurrently.

nebbishhacker · 2021-03-01T19:28:53Z

// don't care about value the global counter stores
// overflow already happened and our goal to update
// counters to the last know good value after the event
// so `new` will store value after the overflow, but it will
// be less than the current global counter's value and
// all updated components won't be marked as changed
let new = GLOBAL_COUNTER_PREV;
let current = component.counter.load(Ordering::Acquire);

if current > new {
    // if someone changed the value in parallel exchange will fail
    // in that case ignore the component and move on to the next
    component.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
}

So your background process would search for components that were modified more than one frame ago, and update their counters so that they appear to have been modified exactly one frame ago? That would cause false positives for any system that didn't run in the last frame (see criteria 7 in #1471 (comment)).

YohDeadfall · 2021-03-01T19:57:53Z

Forgot to mention that system counters should be updated too before component counters, so any system which was disabled for the last N frames won't see false positives:

let new = GLOBAL_COUNTER_PREV;
for system in systems.iter_mut() {
    let current = system.counter.load(Ordering::Acquire);
    if current > new {
        // if someone changed the value in parallel exchange will fail
        // in that case ignore the system and move on to the next
        system.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
    }
}

for component in components.iter_mut() {
    let current = component.counter.load(Ordering::Acquire);
    if current > new {
        // if someone changed the value in parallel exchange will fail
        // in that case ignore the component and move on to the next
        component.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
    }
}

nebbishhacker · 2021-03-01T20:08:21Z

Forgot to mention that system counters should be updated too before component counters, so any system which was disabled for the last N frames won't see false positives:

let new = GLOBAL_COUNTER_PREV;
for system in systems.iter_mut() {
    let current = system.counter.load(Ordering::Acquire);
    if current > new {
        // if someone changed the value in parallel exchange will fail
        // in that case ignore the system and move on to the next
        system.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
    }
}

for component in components.iter_mut() {
    let current = component.counter.load(Ordering::Acquire);
    if current > new {
        // if someone changed the value in parallel exchange will fail
        // in that case ignore the component and move on to the next
        component.counter.compare_exchange(current, new, Ordering::Acquire, Ordering::Relaxed)
    }
}

This would make any system which was disabled for the last N frames miss out on changes completely, replacing the false positives with false negatives.

YohDeadfall · 2021-03-01T21:34:31Z

Ah, got it. Now I see why your code does that difference computation. In that case we should know a distance between the maximum and minimum values of system counters. Since the number of systems is not so large to affect single frame performance by computation of the required adjustment:

fn after_run() {
    if GLOBAL_COUNTER_PREV > GLOBAL_COUNTER {
        let mut min: u32 = 0;
        let mut max: u32 = 0;

        for counter in systems.iter().map(|x| x.counter) {
            min = cmp::min(min, counter);
            max = cmp::max(max, counter);
        }

        // possible optimization
        // if `min > 0` decrease it and `max` by `min`

        // now you know a gap which should be kept
        // adjust counters by incrementing them by an
        // absolute distance between the global and `min`
        let distance = GLOBAL_COUNTER.wrapping_sub(min);

        // update systems and components
        // if some component has value less
        // than `min` set it to `min + distance`
        // note: by "less" I mean wrapping less
        // ...

        // it gives a gap needed for update
        GLOBAL_COUNTER += max.wrapping_sub(min);
    }

    GLOBAL_COUNTER_PREV = GLOBAL_COUNTER;
}

It's a non-allocating counter update which should be done synchronously to avoid false negatives. If it's correct now, we can think a bit about parallelization. You know what resources uses each system, so can have a map of locks to prevent an updating system and its components from being executed or updated. To do it in more efficient way just group them by used components.

For example, you have system X depending on A and B; system Y depending on A and C; and system Z depending on D. The first bucket contains A and B components and X fails into it. Later we see that Y has A which is in the first bucket, so we add C to it and have A, B and C as components in the first bucket while X and Y as depending system. Later we verify Z which shares no components with other systems and it has its own personal bucket with D inside. Now you can bundle updates and put an exclusive lock on the current bucket, so if an system has started execution it will prevent its update in background and vice versa.

By the way, thank you for helping me to understand how it works or should work (:

alice-i-cecile · 2021-03-02T00:29:19Z

The relevant draft of #5 just got merged in can be found here for review.

crates/bevy_ecs/src/lib.rs

crates/bevy_ecs/src/system/into_system.rs

crates/bevy_ecs/src/world/mod.rs

Davier · 2021-03-02T06:51:37Z

I think it's fine to require that systems detecting change are run at some minimal rate (e.g. at least every u32::MAX / 2 global counts), if that brings better performance to the main cases and if the only consequence of failing that requirement is false negatives.
One of the requirements for the change detection is to be light enough to be enabled everywhere.

alice-i-cecile · 2021-03-02T06:55:45Z

I think it's fine to require that systems detecting change are run at some minimal rate (e.g. at least every u32::MAX / 2 global counts), if that brings better performance to the main cases and if the only consequence of failing that requirement is false negatives.
One of the requirements for the change detection is to be light enough to be enabled everywhere.

I agree with this: for very infrequent systems, manual event handling, resources or some form of change detection caching (see #1353) are the way to go. That said, this should be clearly noted in the docs, and we would ideally suggest work-arounds and tests for this there. Pausing is definitely the most serious issue here, especially when taken to more critical domains than games, and should eventually have a dedicated caching solution to avoid dangerous false negative bugs.

Davier · 2021-03-02T08:02:37Z

One of the unsolved question is how to properly handle direct queries (world.query()). It should work transparently in exclusive system (cart suggested to save the current system counter in World), and be possible outside of systems (like all the tests). I think I need to add a new struct that wraps QueryState.

YohDeadfall · 2021-03-02T14:25:47Z

A small implementation note consider @nebbishhacker's and mine proposals. None of them provide guaranties that a new overflow won't happen too soon after updating counters, but it's worth to note that without taking the smallest counter's value for both systems and components is a less reliable solution since one component may be untouched for such long time that in a worst case it even won't solve the overflow problem at all. The same applies to systems as well, but it should be more rare case.

rmsc · 2021-03-02T23:25:41Z

May I suggest one minor tweak?

Don't use a global system counter, instead keep one counter per component type (or archetype)
Increment that counter only when a component of that type (or archetype) may be mutated (iter_mut())

This should slow down the counters to the point where overflows are far in between. That makes it very unlikely that a system will ever see more than half a u32 of difference.

YohDeadfall · 2021-03-03T00:41:43Z

Sounds like a good idea, but is it possible for a system to query components of different types?

Davier · 2021-03-03T05:07:06Z

May I suggest one minor tweak?

Don't use a global system counter, instead keep one counter per component type (or archetype)

Increment that counter only when a component of that type (or archetype) may be mutated (iter_mut())

This should slow down the counters to the point where overflows are far in between. That makes it very unlikely that a system will ever see more than half a u32 of difference.

Unfortunately, that's not a minor tweak implementation-wise. It seems quite involved.

crates/bevy_ecs/src/schedule/stage.rs

cart · 2021-03-19T17:53:08Z

bors r+

# Problem Definition The current change tracking (via flags for both components and resources) fails to detect changes made by systems that are scheduled to run earlier in the frame than they are. This issue is discussed at length in [#68](#68) and [#54](#54). This is very much a draft PR, and contributions are welcome and needed. # Criteria 1. Each change is detected at least once, no matter the ordering. 2. Each change is detected at most once, no matter the ordering. 3. Changes should be detected the same frame that they are made. 4. Competitive ergonomics. Ideally does not require opting-in. 5. Low CPU overhead of computation. 6. Memory efficient. This must not increase over time, except where the number of entities / resources does. 7. Changes should not be lost for systems that don't run. 8. A frame needs to act as a pure function. Given the same set of entities / components it needs to produce the same end state without side-effects. **Exact** change-tracking proposals satisfy criteria 1 and 2. **Conservative** change-tracking proposals satisfy criteria 1 but not 2. **Flaky** change tracking proposals satisfy criteria 2 but not 1. # Code Base Navigation There are three types of flags: - `Added`: A piece of data was added to an entity / `Resources`. - `Mutated`: A piece of data was able to be modified, because its `DerefMut` was accessed - `Changed`: The bitwise OR of `Added` and `Changed` The special behavior of `ChangedRes`, with respect to the scheduler is being removed in [#1313](#1313) and does not need to be reproduced. `ChangedRes` and friends can be found in "bevy_ecs/core/resources/resource_query.rs". The `Flags` trait for Components can be found in "bevy_ecs/core/query.rs". `ComponentFlags` are stored in "bevy_ecs/core/archetypes.rs", defined on line 446. # Proposals **Proposal 5 was selected for implementation.** ## Proposal 0: No Change Detection The baseline, where computations are performed on everything regardless of whether it changed. **Type:** Conservative **Pros:** - already implemented - will never miss events - no overhead **Cons:** - tons of repeated work - doesn't allow users to avoid repeating work (or monitoring for other changes) ## Proposal 1: Earlier-This-Tick Change Detection The current approach as of Bevy 0.4. Flags are set, and then flushed at the end of each frame. **Type:** Flaky **Pros:** - already implemented - simple to understand - low memory overhead (2 bits per component) - low time overhead (clear every flag once per frame) **Cons:** - misses systems based on ordering - systems that don't run every frame miss changes - duplicates detection when looping - can lead to unresolvable circular dependencies ## Proposal 2: Two-Tick Change Detection Flags persist for two frames, using a double-buffer system identical to that used in events. A change is observed if it is found in either the current frame's list of changes or the previous frame's. **Type:** Conservative **Pros:** - easy to understand - easy to implement - low memory overhead (4 bits per component) - low time overhead (bit mask and shift every flag once per frame) **Cons:** - can result in a great deal of duplicated work - systems that don't run every frame miss changes - duplicates detection when looping ## Proposal 3: Last-Tick Change Detection Flags persist for two frames, using a double-buffer system identical to that used in events. A change is observed if it is found in the previous frame's list of changes. **Type:** Exact **Pros:** - exact - easy to understand - easy to implement - low memory overhead (4 bits per component) - low time overhead (bit mask and shift every flag once per frame) **Cons:** - change detection is always delayed, possibly causing painful chained delays - systems that don't run every frame miss changes - duplicates detection when looping ## Proposal 4: Flag-Doubling Change Detection Combine Proposal 2 and Proposal 3. Differentiate between `JustChanged` (current behavior) and `Changed` (Proposal 3). Pack this data into the flags according to [this implementation proposal](#68 (comment)). **Type:** Flaky + Exact **Pros:** - allows users to acc - easy to implement - low memory overhead (4 bits per component) - low time overhead (bit mask and shift every flag once per frame) **Cons:** - users must specify the type of change detection required - still quite fragile to system ordering effects when using the flaky `JustChanged` form - cannot get immediate + exact results - systems that don't run every frame miss changes - duplicates detection when looping ## [SELECTED] Proposal 5: Generation-Counter Change Detection A global counter is increased after each system is run. Each component saves the time of last mutation, and each system saves the time of last execution. Mutation is detected when the component's counter is greater than the system's counter. Discussed [here](#68 (comment)). How to handle addition detection is unsolved; the current proposal is to use the highest bit of the counter as in proposal 1. **Type:** Exact (for mutations), flaky (for additions) **Pros:** - low time overhead (set component counter on access, set system counter after execution) - robust to systems that don't run every frame - robust to systems that loop **Cons:** - moderately complex implementation - must be modified as systems are inserted dynamically - medium memory overhead (4 bytes per component + system) - unsolved addition detection ## Proposal 6: System-Data Change Detection For each system, track which system's changes it has seen. This approach is only worth fully designing and implementing if Proposal 5 fails in some way. **Type:** Exact **Pros:** - exact - conceptually simple **Cons:** - requires storing data on each system - implementation is complex - must be modified as systems are inserted dynamically ## Proposal 7: Total-Order Change Detection Discussed [here](#68 (comment)). This proposal is somewhat complicated by the new scheduler, but I believe it should still be conceptually feasible. This approach is only worth fully designing and implementing if Proposal 5 fails in some way. **Type:** Exact **Pros:** - exact - efficient data storage relative to other exact proposals **Cons:** - requires access to the scheduler - complex implementation and difficulty grokking - must be modified as systems are inserted dynamically # Tests - We will need to verify properties 1, 2, 3, 7 and 8. Priority: 1 > 2 = 3 > 8 > 7 - Ideally we can use identical user-facing syntax for all proposals, allowing us to re-use the same syntax for each. - When writing tests, we need to carefully specify order using explicit dependencies. - These tests will need to be duplicated for both components and resources. - We need to be sure to handle cases where ambiguous system orders exist. `changing_system` is always the system that makes the changes, and `detecting_system` always detects the changes. The component / resource changed will be simple boolean wrapper structs. ## Basic Added / Mutated / Changed 2 x 3 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs before `detecting_system` - verify at the end of tick 2 ## At Least Once 2 x 3 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs after `detecting_system` - verify at the end of tick 2 ## At Most Once 2 x 3 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs once before `detecting_system` - increment a counter based on the number of changes detected - verify at the end of tick 2 ## Fast Detection 2 x 3 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs before `detecting_system` - verify at the end of tick 1 ## Ambiguous System Ordering Robustness 2 x 3 x 2 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs [before/after] `detecting_system` in tick 1 - `changing_system` runs [after/before] `detecting_system` in tick 2 ## System Pausing 2 x 3 design: - Resources vs. Components - Added vs. Changed vs. Mutated - `changing_system` runs in tick 1, then is disabled by run criteria - `detecting_system` is disabled by run criteria until it is run once during tick 3 - verify at the end of tick 3 ## Addition Causes Mutation 2 design: - Resources vs. Components - `adding_system_1` adds a component / resource - `adding system_2` adds the same component / resource - verify the `Mutated` flag at the end of the tick - verify the `Added` flag at the end of the tick First check tests for: #333 Second check tests for: #1443 ## Changes Made By Commands - `adding_system` runs in Update in tick 1, and sends a command to add a component - `detecting_system` runs in Update in tick 1 and 2, after `adding_system` - We can't detect the changes in tick 1, since they haven't been processed yet - If we were to track these changes as being emitted by `adding_system`, we can't detect the changes in tick 2 either, since `detecting_system` has already run once after `adding_system` :( # Benchmarks See: [general advice](https://github.com/bevyengine/bevy/blob/master/docs/profiling.md), [Criterion crate](https://github.com/bheisler/criterion.rs) There are several critical parameters to vary: 1. entity count (1 to 10^9) 2. fraction of entities that are changed (0% to 100%) 3. cost to perform work on changed entities, i.e. workload (1 ns to 1s) 1 and 2 should be varied between benchmark runs. 3 can be added on computationally. We want to measure: - memory cost - run time We should collect these measurements across several frames (100?) to reduce bootup effects and accurately measure the mean, variance and drift. Entity-component change detection is much more important to benchmark than resource change detection, due to the orders of magnitude higher number of pieces of data. No change detection at all should be included in benchmarks as a second control for cases where missing changes is unacceptable. ## Graphs 1. y: performance, x: log_10(entity count), color: proposal, facet: performance metric. Set cost to perform work to 0. 2. y: run time, x: cost to perform work, color: proposal, facet: fraction changed. Set number of entities to 10^6 3. y: memory, x: frames, color: proposal # Conclusions 1. Is the theoretical categorization of the proposals correct according to our tests? 2. How does the performance of the proposals compare without any load? 3. How does the performance of the proposals compare with realistic loads? 4. At what workload does more exact change tracking become worth the (presumably) higher overhead? 5. When does adding change-detection to save on work become worthwhile? 6. Is there enough divergence in performance between the best solutions in each class to ship more than one change-tracking solution? # Implementation Plan 1. Write a test suite. 2. Verify that tests fail for existing approach. 3. Write a benchmark suite. 4. Get performance numbers for existing approach. 5. Implement, test and benchmark various solutions using a Git branch per proposal. 6. Create a draft PR with all solutions and present results to team. 7. Select a solution and replace existing change detection. Co-authored-by: Brice DAVIER <bricedavier@gmail.com> Co-authored-by: Carter Anderson <mcanders1@gmail.com>

bors · 2021-03-19T18:08:15Z

Pull request successfully merged into main.

Build succeeded:

Removing the checks on this line https://github.com/bevyengine/bevy/blob/main/crates/bevy_sprite/src/frustum_culling.rs#L64 and running the "many_sprites" example revealed two corner case bugs in bevy_ecs. The first, a simple and honest missed line introduced in #1471. The other, an insidious monster that has been there since the ECS v2 rewrite, just waiting for the time to strike: 1. #1471 accidentally removed the "insert" line for sparse set components with the "mutated" bundle state. Re-adding it fixes the problem. I did a slight refactor here to make the implementation simpler and remove a branch. 2. The other issue is nastier. ECS v2 added an "archetype graph". When determining what components were added/mutated during an archetype change, we read the FromBundle edge (which encodes this state) on the "new" archetype. The problem is that unlike "add edges" which are guaranteed to be unique for a given ("graph node", "bundle id") pair, FromBundle edges are not necessarily unique: ```rust // OLD_ARCHETYPE -> NEW_ARCHETYPE // [] -> [usize] e.insert(2usize); // [usize] -> [usize, i32] e.insert(1i32); // [usize, i32] -> [usize, i32] e.insert(1i32); // [usize, i32] -> [usize] e.remove::<i32>(); // [usize] -> [usize, i32] e.insert(1i32); ``` Note that the second `e.insert(1i32)` command has a different "archetype graph edge" than the first, but they both lead to the same "new archetype". The fix here is simple: just remove FromBundle edges because they are broken and store the information in the "add edges", which are guaranteed to be unique. FromBundle edges were added to cut down on the number of archetype accesses / make the archetype access patterns nicer. But benching this change resulted in no significant perf changes and the addition of get_2_mut() for archetypes resolves the access pattern issue.

## Objective - ~~Make absurdly long-lived changes stay detectable for even longer (without leveling up to `u64`).~~ - Give all changes a consistent maximum lifespan. - Improve code clarity. ## Solution - ~~Increase the frequency of `check_tick` scans to increase the oldest reliably-detectable change.~~ (Deferred until we can benchmark the cost of a scan.) - Ignore changes older than the maximum reliably-detectable age. - General refactoring—name the constants, use them everywhere, and update the docs. - Update test cases to check for the specified behavior. ## Related This PR addresses (at least partially) the concerns raised in: - #3071 - #3082 (and associated PR #3084) ## Background - #1471 Given the minimum interval between `check_ticks` scans, `N`, the oldest reliably-detectable change is `u32::MAX - (2 * N - 1)` (or `MAX_CHANGE_AGE`). Reducing `N` from ~530 million (current value) to something like ~2 million would extend the lifetime of changes by a billion. | minimum `check_ticks` interval | oldest reliably-detectable change | usable % of `u32::MAX` | | --- | --- | --- | | `u32::MAX / 8` (536,870,911) | `(u32::MAX / 4) * 3` | 75.0% | | `2_000_000` | `u32::MAX - 3_999_999` | 99.9% | Similarly, changes are still allowed to be between `MAX_CHANGE_AGE`-old and `u32::MAX`-old in the interim between `check_tick` scans. While we prevent their age from overflowing, the test to detect changes still compares raw values. This makes failure ultimately unreliable, since when ancient changes stop being detected varies depending on when the next scan occurs. ## Open Question Currently, systems and system states are incorrectly initialized with their `last_change_tick` set to `0`, which doesn't handle wraparound correctly. For consistent behavior, they should either be initialized to the world's `last_change_tick` (and detect no changes) or to `MAX_CHANGE_AGE` behind the world's current `change_tick` (and detect everything as a change). I've currently gone with the latter since that was closer to the existing behavior. ## Follow-up Work (Edited: entire section) We haven't actually profiled how long a `check_ticks` scan takes on a "large" `World` , so we don't know if it's safe to increase their frequency. However, we are currently relying on play sessions not lasting long enough to trigger a scan and apps not having enough entities/archetypes for it to be "expensive" (our assumption). That isn't a real solution. (Either scanning never costs enough to impact frame times or we provide an option to use `u64` change ticks. Nobody will accept random hiccups.) To further extend the lifetime of changes, we actually only need to increment the world tick if a system has `Fetch: !ReadOnlySystemParamFetch`. The behavior will be identical because all writes are sequenced, but I'm not sure how to implement that in a way that the compiler can optimize the branch out. Also, since having no false positives depends on a `check_ticks` scan running at least every `2 * N - 1` ticks, a `last_check_tick` should also be stored in the `World` so that any lull in system execution (like a command flush) could trigger a scan if needed. To be completely robust, all the systems initialized on the world should be scanned, not just those in the current stage.

## Objective - ~~Make absurdly long-lived changes stay detectable for even longer (without leveling up to `u64`).~~ - Give all changes a consistent maximum lifespan. - Improve code clarity. ## Solution - ~~Increase the frequency of `check_tick` scans to increase the oldest reliably-detectable change.~~ (Deferred until we can benchmark the cost of a scan.) - Ignore changes older than the maximum reliably-detectable age. - General refactoring—name the constants, use them everywhere, and update the docs. - Update test cases to check for the specified behavior. ## Related This PR addresses (at least partially) the concerns raised in: - bevyengine#3071 - bevyengine#3082 (and associated PR bevyengine#3084) ## Background - bevyengine#1471 Given the minimum interval between `check_ticks` scans, `N`, the oldest reliably-detectable change is `u32::MAX - (2 * N - 1)` (or `MAX_CHANGE_AGE`). Reducing `N` from ~530 million (current value) to something like ~2 million would extend the lifetime of changes by a billion. | minimum `check_ticks` interval | oldest reliably-detectable change | usable % of `u32::MAX` | | --- | --- | --- | | `u32::MAX / 8` (536,870,911) | `(u32::MAX / 4) * 3` | 75.0% | | `2_000_000` | `u32::MAX - 3_999_999` | 99.9% | Similarly, changes are still allowed to be between `MAX_CHANGE_AGE`-old and `u32::MAX`-old in the interim between `check_tick` scans. While we prevent their age from overflowing, the test to detect changes still compares raw values. This makes failure ultimately unreliable, since when ancient changes stop being detected varies depending on when the next scan occurs. ## Open Question Currently, systems and system states are incorrectly initialized with their `last_change_tick` set to `0`, which doesn't handle wraparound correctly. For consistent behavior, they should either be initialized to the world's `last_change_tick` (and detect no changes) or to `MAX_CHANGE_AGE` behind the world's current `change_tick` (and detect everything as a change). I've currently gone with the latter since that was closer to the existing behavior. ## Follow-up Work (Edited: entire section) We haven't actually profiled how long a `check_ticks` scan takes on a "large" `World` , so we don't know if it's safe to increase their frequency. However, we are currently relying on play sessions not lasting long enough to trigger a scan and apps not having enough entities/archetypes for it to be "expensive" (our assumption). That isn't a real solution. (Either scanning never costs enough to impact frame times or we provide an option to use `u64` change ticks. Nobody will accept random hiccups.) To further extend the lifetime of changes, we actually only need to increment the world tick if a system has `Fetch: !ReadOnlySystemParamFetch`. The behavior will be identical because all writes are sequenced, but I'm not sure how to implement that in a way that the compiler can optimize the branch out. Also, since having no false positives depends on a `check_ticks` scan running at least every `2 * N - 1` ticks, a `last_check_tick` should also be stored in the `World` so that any lull in system execution (like a command flush) could trigger a scan if needed. To be completely robust, all the systems initialized on the world should be scanned, not just those in the current stage.

alice-i-cecile added A-Core Common functionality for all bevy apps A-ECS Entities, components, systems, and events help wanted P-High This is particularly urgent, and deserves immediate attention labels Feb 18, 2021

Davier mentioned this pull request Feb 19, 2021

System-order-independent ECS change tracking #68

Closed

Base automatically changed from master to main February 19, 2021 20:44

Davier reviewed Mar 2, 2021

View reviewed changes

crates/bevy_ecs/src/lib.rs Outdated Show resolved Hide resolved

crates/bevy_ecs/src/system/into_system.rs Outdated Show resolved Hide resolved

crates/bevy_ecs/src/world/mod.rs Outdated Show resolved Hide resolved

Davier added 7 commits March 18, 2021 13:07

even more renames

ae3f3aa

comment change detection

27a7292

cargo fmt

5c4927c

Change default tick values

a8075a2

Re-add and adapt removed test

7713983

Simplify clear_trackers

3922267

Fix merge issue

59f8c85

cart reviewed Mar 18, 2021

View reviewed changes

crates/bevy_ecs/src/schedule/stage.rs Show resolved Hide resolved

Davier added 2 commits March 19, 2021 21:12

Test Changed queries

344ae7b

cargo fmt

b03e07e

bors bot changed the title ~~Reliable change detection~~ [Merged by Bors] - Reliable change detection Mar 19, 2021

bors bot closed this Mar 19, 2021

This was referenced Mar 20, 2021

[Merged by Bors] - make ComponentTicks::set_changed public #1711

Closed

Create manual set_changed and set_added methods #1718

Closed

thread 'main' panicked at 'Entity does not exist' #1743

Closed

cart mentioned this pull request Mar 25, 2021

[Merged by Bors] - Fix sparse insert #1748

Closed

This was referenced Mar 26, 2021

RFC: Create an RFC Process #1662

Closed

Automatic event tracking #1776

Closed

This was referenced Apr 16, 2021

Reliable change detection Leafwing-Studios/leafwing-website#52

Open

Duplicated components are not detected using Added query #1443

Closed

maniwani mentioned this pull request Feb 15, 2022

[Merged by Bors] - Make change lifespan deterministic and update docs #3956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Reliable change detection #1471

[Merged by Bors] - Reliable change detection #1471

alice-i-cecile commented Feb 18, 2021 •

edited

Loading

alice-i-cecile commented Feb 18, 2021

Davier commented Feb 27, 2021

Davier commented Feb 27, 2021

nebbishhacker commented Feb 28, 2021

Davier commented Mar 1, 2021

alice-i-cecile commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 •

edited

Loading

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021

bjorn3 commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 •

edited

Loading

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 •

edited

Loading

alice-i-cecile commented Mar 2, 2021 •

edited

Loading

Davier commented Mar 2, 2021

alice-i-cecile commented Mar 2, 2021 •

edited

Loading

Davier commented Mar 2, 2021

YohDeadfall commented Mar 2, 2021

rmsc commented Mar 2, 2021

YohDeadfall commented Mar 3, 2021

Davier commented Mar 3, 2021 •

edited

Loading

cart commented Mar 19, 2021

bors bot commented Mar 19, 2021

[Merged by Bors] - Reliable change detection #1471

[Merged by Bors] - Reliable change detection #1471

Conversation

alice-i-cecile commented Feb 18, 2021 • edited Loading

Problem Definition

Criteria

Code Base Navigation

Proposals

Proposal 0: No Change Detection

Proposal 1: Earlier-This-Tick Change Detection

Proposal 2: Two-Tick Change Detection

Proposal 3: Last-Tick Change Detection

Proposal 4: Flag-Doubling Change Detection

[SELECTED] Proposal 5: Generation-Counter Change Detection

Proposal 6: System-Data Change Detection

Proposal 7: Total-Order Change Detection

Tests

Basic Added / Mutated / Changed

At Least Once

At Most Once

Fast Detection

Ambiguous System Ordering Robustness

System Pausing

Addition Causes Mutation

Changes Made By Commands

Benchmarks

Graphs

Conclusions

Implementation Plan

alice-i-cecile commented Feb 18, 2021

Results

Davier commented Feb 27, 2021

Davier commented Feb 27, 2021

nebbishhacker commented Feb 28, 2021

Davier commented Mar 1, 2021

alice-i-cecile commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 • edited Loading

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021

bjorn3 commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 • edited Loading

nebbishhacker commented Mar 1, 2021

YohDeadfall commented Mar 1, 2021 • edited Loading

alice-i-cecile commented Mar 2, 2021 • edited Loading

Davier commented Mar 2, 2021

alice-i-cecile commented Mar 2, 2021 • edited Loading

Davier commented Mar 2, 2021

YohDeadfall commented Mar 2, 2021

rmsc commented Mar 2, 2021

YohDeadfall commented Mar 3, 2021

Davier commented Mar 3, 2021 • edited Loading

cart commented Mar 19, 2021

bors bot commented Mar 19, 2021

alice-i-cecile commented Feb 18, 2021 •

edited

Loading

YohDeadfall commented Mar 1, 2021 •

edited

Loading

YohDeadfall commented Mar 1, 2021 •

edited

Loading

YohDeadfall commented Mar 1, 2021 •

edited

Loading

alice-i-cecile commented Mar 2, 2021 •

edited

Loading

alice-i-cecile commented Mar 2, 2021 •

edited

Loading

Davier commented Mar 3, 2021 •

edited

Loading