Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake leave-via-kick/ban faster joins tests #628

Merged
merged 1 commit into from
Mar 16, 2023

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Mar 14, 2023

We need to allow the homeserver under test to finish making requests to
the Complement homeserver before tearing it down, otherwise the
homeserver under test may mark the hostname:port combination as down and
refuse to contact it in subsequent tests.

Signed-off-by: Sean Quah seanq@matrix.org


NB: you can tease out the flakes by forcing Complement to reuse ports whenever possible:

diff --git a/internal/federation/server.go b/internal/federation/server.go
index ed7974d..4c4ccbe 100644
--- a/internal/federation/server.go
+++ b/internal/federation/server.go
@@ -455,7 +455,10 @@ func (s *Server) Listen() (cancel func()) {
        var wg sync.WaitGroup
        wg.Add(1)

-       ln, err := net.Listen("tcp", ":0") //nolint
+       ln, err := net.Listen("tcp", ":63333") //nolint
+       if err != nil {
+               ln, err = net.Listen("tcp", ":63334") //nolint
+       }
        if err != nil {
                s.t.Fatalf("ListenFederationServer: net.Listen failed: %s", err)
        }

We need to allow the homeserver under test to finish making requests to
the Complement homeserver before tearing it down, otherwise the
homeserver under test may mark the hostname:port combination as down and
refuse to contact it in subsequent tests.

Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx requested review from a team as code owners March 14, 2023 17:22
Comment on lines +3721 to +3724
// Dirty hack to allow the homeserver under test to finish making requests to the
// Complement homeserver as part of syncing the full state.
psjResult.AwaitStateIdsRequest(t)
time.Sleep(time.Second / 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly hacky. Alternative suggestions welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top of my head: we could make the homeserver under test take an action in the room that requires knowledge of the full state, and wait for complement's HS to see it. (Is there an example of a federation API that blocks on full state?)

(Personally I can live with the hack though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difficulty is that the homeserver under test leaves the room independently of, and before the sync process finishes. Any request we might try would be met with a not-in-the-room response even though the sync process may still be running.

@DMRobertson
Copy link
Contributor

otherwise the homeserver under test may mark the hostname:port combination as down and refuse to contact it in subsequent tests.

Should we add some kind of escape hatch for this? E.g. an admin API which could be used to mark a destination as up, that complement could call whenever a test begins? (Not ideal, but we seem to hit this class of flake a lot...)

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg, this is one of my tests. I think we did talk about it at the time but clearly I missed this class of flakes/interaction with other tests---sorry.

Comment on lines +3721 to +3724
// Dirty hack to allow the homeserver under test to finish making requests to the
// Complement homeserver as part of syncing the full state.
psjResult.AwaitStateIdsRequest(t)
time.Sleep(time.Second / 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top of my head: we could make the homeserver under test take an action in the room that requires knowledge of the full state, and wait for complement's HS to see it. (Is there an example of a federation API that blocks on full state?)

(Personally I can live with the hack though)

@squahtx
Copy link
Contributor Author

squahtx commented Mar 16, 2023

otherwise the homeserver under test may mark the hostname:port combination as down and refuse to contact it in subsequent tests.

Should we add some kind of escape hatch for this? E.g. an admin API which could be used to mark a destination as up, that complement could call whenever a test begins? (Not ideal, but we seem to hit this class of flake a lot...)

We could do. It's very Synapse and faster joins test-specific though. We only hit this class of flake because we reuse the same Synapse with lots of different Complement instances in the faster joins test suite.

@squahtx squahtx merged commit 6f41799 into main Mar 16, 2023
@squahtx squahtx deleted the squah/faster_room_joins_fix_test_flakes_3 branch March 16, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants