WIP: failing proptest for willow sync #2695

matheus23 · 2024-09-04T16:46:17Z

Description

This is a proptest that runs several "rounds" of a protocol. Each "round", one side does some writes. In between rounds and at the end, the side that wrote calls sync_once with the other side.

I've seen the test fail in four ways:

Most commonly "closed by peer: 0". This could be a timing thing and may or may not be an actual issue (perhaps just something to ignore).
Some error from RecvStream, but rarely that
Exhausted thread limit (4096 threads). I think that's just due to creating a lot of nodes with spawn_node. Lots of threads created with the blob, docs and willow store.
Finally the one that actually worries me: "states out of sync" (the test's assertion fails)

Here's some output of a run that fails with the last kind of error (output cleaned up somewhat):

Failing test output

thread 'test_get_many_weird_result' panicked at iroh/tests/spaces.rs:69:1:
Test failed: states out of sync:
{
    (Alfie, "alpha"): "gamma",
    (Alfie, "beta"): "alpha",
    (Alfie, "gamma"): "alpha",
    (Betty, "alpha"): "alpha",
    (Betty, "beta"): "beta",
    (Betty, "gamma"): "alpha",
}
 !=
{
    (Alfie, "alpha"): "alpha",
    (Alfie, "beta"): "alpha",
    (Alfie, "gamma"): "alpha",
    (Betty, "alpha"): "alpha",
    (Betty, "beta"): "beta",
    (Betty, "gamma"): "alpha",
}.
minimal failing input: input = _TestGetManyWeirdResultArgs {
    rounds: [
        (
            Alfie,
            [
                Write("beta", "beta"),
                Write("beta", "beta"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("beta", "gamma"),
                Write("beta", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
                Write("beta", "beta"),
                Write("gamma", "beta"),
                Write("gamma", "beta"),
                Write("gamma", "gamma"),
                Write("gamma", "beta"),
                Write("beta", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("alpha", "alpha"),
                Write("beta", "gamma"),
                Write("alpha", "gamma"),
                Write("beta", "gamma"),
                Write("alpha", "alpha"),
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("beta", "beta"),
                Write("beta", "alpha"),
                Write("gamma", "gamma"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("alpha", "gamma"),
            ],
        ),
        (
            Betty,
            [
                Write("gamma", "beta"),
                Write("gamma", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "gamma"),
                Write("beta", "gamma"),
                Write("alpha", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
                Write("beta", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("alpha", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "alpha"),
                Write("alpha", "gamma"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("alpha", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "gamma"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("beta", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "alpha"),
                Write("beta", "alpha"),
                Write("alpha", "beta"),
            ],
        ),
        (
            Betty,
            [],
        ),
        (
            Alfie,
            [
                Write("gamma", "alpha"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "alpha"),
                Write("alpha", "beta"),
                Write("beta", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "gamma"),
                Write("gamma", "gamma"),
                Write("alpha", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "alpha"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "gamma"),
            ],
        ),
    ],
}

Unfortunately, there's a bunch of randomness involved in willow, so it's hard to reproduce consistently, and also hard to shrink for proptest as a result of that.
Dialing down the complexity of the tests (fewer rounds, smaller rounds), also seems to make it much harder to reproduce this issue.

I started testing this, because I was often seeing weird results for get_many in my port of tauri-todos for willow, where suddenly some entries were duplicated and some entries missing entirely. It actually happens much more often in the tauri version, but the setup is also slightly different (real time & continuous sync vs. clear rounds with waiting for finished syncs).

Breaking Changes

Notes & open questions

TODO:

See if having "concurrent" writes helps reproduce more often (i.e. have both side make changes between syncs)
Find out if there's a way I can make sync/the PAI stuff more deterministic?

Change checklist

Self-review.
Documentation updates following the style guide, if relevant.
Tests if relevant.
All breaking changes documented.

github-actions · 2024-09-04T16:49:26Z

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2695/docs/iroh/

Last updated: 2024-09-04T16:46:17Z

Frando · 2024-09-16T14:06:24Z

Some updates:

The "states out of sync" failure should be fixed by a) refactor: remove count field from ReconciliationAnnounceEntries #2724 and b) changing the test so that both nodes add an intent. otherwise we currently miss an event to wait for a finished sync on the betty side. alternatively expose some event from the peer manager whenever a sync session terminates, and use that in the proptest (we should do that in any case likely)
the peer manager has an issue around connections closing at the same time as new incoming connections leading to connection closes, the closed by peer: 0 error. This is fixed by refactor(iroh-willow): refactor peer manager so that proptest passes reliably #2727 (as well as changing the proptest to work in both directions simultaneously).

matheus23 · 2024-09-23T15:28:25Z

Closing this in favor of #2727

matheus23 · 2024-09-23T15:32:45Z

The "states out of sync" failure should be fixed by a) #2724 and b) changing the test so that both nodes add an intent. otherwise we currently miss an event to wait for a finished sync on the betty side. alternatively expose some event from the peer manager whenever a sync session terminates, and use that in the proptest (we should do that in any case likely)

Hmm. #2724 is still a mystery to me. On the surface it seems very unrelated. Why should some additional info during sync affect the final state of the store?
Well, I'll dig into that. Also curious to see if that helps with the problems I was seeing with the tauri-todos example.

…reliably (#2727) ## Description Fixes #2695 * Refactor peer manager to really keep track of all connections, the previous logic of a single peer state was flawed for simultaneous accepts while closing previous connections. * Better debuggability of reconciler * Add proptest from #2695 and refactor to run in both directions simultaneously. ## Breaking Changes  ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>

Frando and others added 6 commits August 30, 2024 00:54

feat: event subscriptions

f8bd54a

feat: RPC for subscriptions

9274fd1

fix: fixes to subscriptions and add test

d4e99c3

fix test

68f494b

chore: clippy

a2d8316

failing proptest for willow sync

bd3971b

matheus23 self-assigned this Sep 4, 2024

Frando force-pushed the Frando/willow-event-subscriptions branch from a2d8316 to 46c386c Compare September 9, 2024 10:07

Frando mentioned this pull request Sep 10, 2024

refactor(iroh-willow): refactor peer manager so that proptest passes reliably #2727

Merged

4 tasks

matheus23 closed this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: failing proptest for willow sync #2695

WIP: failing proptest for willow sync #2695

matheus23 commented Sep 4, 2024 •

edited

Loading

github-actions bot commented Sep 4, 2024

Frando commented Sep 16, 2024

matheus23 commented Sep 23, 2024

matheus23 commented Sep 23, 2024

WIP: failing proptest for willow sync #2695

WIP: failing proptest for willow sync #2695

Conversation

matheus23 commented Sep 4, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Change checklist

github-actions bot commented Sep 4, 2024

Frando commented Sep 16, 2024

matheus23 commented Sep 23, 2024

matheus23 commented Sep 23, 2024

matheus23 commented Sep 4, 2024 •

edited

Loading