Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: failing proptest for willow sync #2695

Closed

Conversation

matheus23
Copy link
Contributor

@matheus23 matheus23 commented Sep 4, 2024

Description

This is a proptest that runs several "rounds" of a protocol. Each "round", one side does some writes. In between rounds and at the end, the side that wrote calls sync_once with the other side.

I've seen the test fail in four ways:

  • Most commonly "closed by peer: 0". This could be a timing thing and may or may not be an actual issue (perhaps just something to ignore).
  • Some error from RecvStream, but rarely that
  • Exhausted thread limit (4096 threads). I think that's just due to creating a lot of nodes with spawn_node. Lots of threads created with the blob, docs and willow store.
  • Finally the one that actually worries me: "states out of sync" (the test's assertion fails)

Here's some output of a run that fails with the last kind of error (output cleaned up somewhat):

Failing test output
thread 'test_get_many_weird_result' panicked at iroh/tests/spaces.rs:69:1:
Test failed: states out of sync:
{
    (Alfie, "alpha"): "gamma",
    (Alfie, "beta"): "alpha",
    (Alfie, "gamma"): "alpha",
    (Betty, "alpha"): "alpha",
    (Betty, "beta"): "beta",
    (Betty, "gamma"): "alpha",
}
 !=
{
    (Alfie, "alpha"): "alpha",
    (Alfie, "beta"): "alpha",
    (Alfie, "gamma"): "alpha",
    (Betty, "alpha"): "alpha",
    (Betty, "beta"): "beta",
    (Betty, "gamma"): "alpha",
}.
minimal failing input: input = _TestGetManyWeirdResultArgs {
    rounds: [
        (
            Alfie,
            [
                Write("beta", "beta"),
                Write("beta", "beta"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("beta", "gamma"),
                Write("beta", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
                Write("beta", "beta"),
                Write("gamma", "beta"),
                Write("gamma", "beta"),
                Write("gamma", "gamma"),
                Write("gamma", "beta"),
                Write("beta", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("alpha", "alpha"),
                Write("beta", "gamma"),
                Write("alpha", "gamma"),
                Write("beta", "gamma"),
                Write("alpha", "alpha"),
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("beta", "beta"),
                Write("beta", "alpha"),
                Write("gamma", "gamma"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("alpha", "gamma"),
            ],
        ),
        (
            Betty,
            [
                Write("gamma", "beta"),
                Write("gamma", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "gamma"),
                Write("beta", "gamma"),
                Write("alpha", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
                Write("beta", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("alpha", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "alpha"),
                Write("gamma", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "beta"),
            ],
        ),
        (
            Alfie,
            [
                Write("gamma", "alpha"),
                Write("gamma", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "alpha"),
                Write("alpha", "gamma"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("alpha", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "gamma"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "beta"),
                Write("alpha", "beta"),
                Write("alpha", "alpha"),
                Write("alpha", "alpha"),
                Write("beta", "beta"),
                Write("alpha", "gamma"),
                Write("alpha", "alpha"),
                Write("beta", "alpha"),
                Write("alpha", "beta"),
            ],
        ),
        (
            Betty,
            [],
        ),
        (
            Alfie,
            [
                Write("gamma", "alpha"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "alpha"),
                Write("alpha", "beta"),
                Write("beta", "alpha"),
                Write("beta", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "gamma"),
                Write("gamma", "gamma"),
                Write("alpha", "beta"),
                Write("gamma", "gamma"),
                Write("beta", "beta"),
            ],
        ),
        (
            Betty,
            [
                Write("alpha", "alpha"),
                Write("gamma", "beta"),
                Write("gamma", "alpha"),
            ],
        ),
        (
            Alfie,
            [
                Write("alpha", "gamma"),
            ],
        ),
    ],
}

Unfortunately, there's a bunch of randomness involved in willow, so it's hard to reproduce consistently, and also hard to shrink for proptest as a result of that.
Dialing down the complexity of the tests (fewer rounds, smaller rounds), also seems to make it much harder to reproduce this issue.


I started testing this, because I was often seeing weird results for get_many in my port of tauri-todos for willow, where suddenly some entries were duplicated and some entries missing entirely. It actually happens much more often in the tauri version, but the setup is also slightly different (real time & continuous sync vs. clear rounds with waiting for finished syncs).

Breaking Changes

Notes & open questions

TODO:

  • See if having "concurrent" writes helps reproduce more often (i.e. have both side make changes between syncs)
  • Find out if there's a way I can make sync/the PAI stuff more deterministic?

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@matheus23 matheus23 self-assigned this Sep 4, 2024
Copy link

github-actions bot commented Sep 4, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2695/docs/iroh/

Last updated: 2024-09-04T16:46:17Z

@Frando
Copy link
Member

Frando commented Sep 16, 2024

Some updates:

@matheus23
Copy link
Contributor Author

Closing this in favor of #2727

@matheus23 matheus23 closed this Sep 23, 2024
@matheus23
Copy link
Contributor Author

The "states out of sync" failure should be fixed by a) #2724 and b) changing the test so that both nodes add an intent. otherwise we currently miss an event to wait for a finished sync on the betty side. alternatively expose some event from the peer manager whenever a sync session terminates, and use that in the proptest (we should do that in any case likely)

Hmm. #2724 is still a mystery to me. On the surface it seems very unrelated. Why should some additional info during sync affect the final state of the store?
Well, I'll dig into that. Also curious to see if that helps with the problems I was seeing with the tauri-todos example.

matheus23 added a commit that referenced this pull request Sep 24, 2024
…reliably (#2727)

## Description

Fixes #2695 

* Refactor peer manager to really keep track of all connections, the
previous logic of a single peer state was flawed for simultaneous
accepts while closing previous connections.
* Better debuggability of reconciler
* Add proptest from #2695 and refactor to run in both directions
simultaneously.

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.

---------

Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants