Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) #214

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Oct 19, 2021

Test that /messages works on remote homeserver and can be backfilled properly after many batches

Part of MSC2716: matrix-org/matrix-spec-proposals#2716

Related to #212

Used while developing matrix-org/synapse#11114

Dev notes

	logrus.WithFields(logrus.Fields{
		"stack": string(debug.Stack()),
	}).Error("destroy stack")
var firstEvent gjson.Result
var lastEvent gjson.Result
if len(eventsJson) > 0 {
	firstEvent = eventsJson[0]
	lastEvent = eventsJson[len(eventsJson)-1]
}
logrus.WithFields(logrus.Fields{
	"events":     len(eventsJson),
	"firstEvent": fmt.Sprintf("%s (%s)", firstEvent.Get("event_id"), firstEvent.Get("type")),
	"lastEvent":  fmt.Sprintf("%s (%s)", lastEvent.Get("event_id"), lastEvent.Get("type")),
	"start":      gjson.GetBytes(messsageResBody, "start").Str,
	"end":        gjson.GetBytes(messsageResBody, "end").Str,
}).Error("Number of events from `/messages`")

@MadLittleMods MadLittleMods marked this pull request as draft October 19, 2021 07:00
@MadLittleMods MadLittleMods changed the title Draft: Test that /messages works on remote homeserver and can be backfilled properly after many batches Draft: Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) Oct 20, 2021
tests/msc2716_test.go Outdated Show resolved Hide resolved
tests/msc2716_test.go Outdated Show resolved Hide resolved
tests/msc2716_test.go Outdated Show resolved Hide resolved
}

for {
if time.Since(start) > 200*c.SyncUntilTimeout {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is currently very large to accommodate the long ~20s /messages durations. We also have to make that request 11 times during the test which adds up very fast.

Synapse really has to chug for those requests 👹 and ideally wouldn't have to modify this at all.

I would need to look into optimizing Synapse to make this fast which we should probably do anyway as this is painfully slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently set to a whopping 1000s by default

SyncUntilTimeout: 5 * time.Second,
which seems excessive even given this.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kegsay Sorry this wasn't clear as undrafting indicates "marked this pull request as ready for review" but I didn't assign you this one yet specifically because of this problem.

The test itself is good to go (timeout can be switched to normal and numBatches could be switched back to 11) and shipped but want to make it actually acceptable time-wise to run against Synapse before merging.

I used numBatches=2 during testing because it's much faster to see results while developing.

Thanks for the review pass though and I'll fix up these other spots ⏩

Copy link
Contributor Author

@MadLittleMods MadLittleMods May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods what is the status of this PR? It's still marked as a Draft.

#214 (comment)

Status is still the same as the last update in this thread. It's too slow on Synapse for me to be comfortable merging it yet.

Copy link
Contributor Author

@MadLittleMods MadLittleMods May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of optimizing Synapse to make this test viable to run time-wise, I'm a bit blocked on a race condition in some recent code, matrix-org/synapse#12394 (comment) -> matrix-org/synapse#12646

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of optimizing Synapse to make this test viable to run time-wise, I'm a bit blocked on a race condition in some recent code, matrix-org/synapse#12394 (comment) -> matrix-org/synapse#12646

Now matrix-org/synapse#12988 I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods MadLittleMods changed the title Draft: Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) Nov 3, 2021
@MadLittleMods MadLittleMods marked this pull request as ready for review November 3, 2021 07:53
internal/client/client.go Outdated Show resolved Hide resolved
// could pick for other homeserver implementations so I just did 10+1 to
// make sure it also worked in that case.
//numBatches := 11
numBatches := 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely comment but then you seem to use 2. Why?

Copy link
Contributor Author

@MadLittleMods MadLittleMods Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See:

I used numBatches=2 during testing because it's much faster to see results while developing.

#214 (comment)

Running this test against Synapse is extremely slow atm, #214 (comment)


TODO: Reminder to revert this to 11 before merge

tests/msc2716_test.go Outdated Show resolved Hide resolved
tests/msc2716_test.go Outdated Show resolved Hide resolved
}

// Make sure we see the events at the very start of the message history
expectedEventIDs = append(expectedEventIDs, eventIDsBefore...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is new -> old?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order doesn't matter because it's just a list of expected ID's, expectedEventIDs

We also just use it in check off function (paginateUntilMessageCheckOff) which doesn't care about the order.


There is a similar variable in another test where order does matter but we call it expectedEventIDOrder

}

for {
if time.Since(start) > 200*c.SyncUntilTimeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently set to a whopping 1000s by default

SyncUntilTimeout: 5 * time.Second,
which seems excessive even given this.

tests/msc2716_test.go Outdated Show resolved Hide resolved
actualEventIDList = append(actualEventIDList, eventID)

if _, keyExists := workingExpectedEventIDMap[eventID]; keyExists {
delete(workingExpectedEventIDMap, eventID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function doesn't actually check the events are in the correct order, just that they exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, just a checkoff function

if !keyRes.Exists() {
t.Fatalf("missing key '%s'", wantKey)
}
if !keyRes.IsArray() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use JSONArrayEach?

func JSONArrayEach(wantKey string, fn func(gjson.Result) error) JSON {
It does these checks for you.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I think it was because we can't break early with JSONArrayEach (can only break early if you throw an error which fails the test). JSONArrayEach could be refactored to have the item function return keepGoing, err but rather do that in another PR since it touches so many other tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the early return really a deal breaker? I don't think it'll affect runtime performance that much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to re-use JSONArrayEach. It's not so bad ⏩

@MadLittleMods MadLittleMods changed the title Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) Draft: Test that /messages works on remote homeserver and can be backfilled properly after many batches (MSC2716) Nov 10, 2021
@MadLittleMods MadLittleMods marked this pull request as draft November 11, 2021 01:12
tests/msc2716_test.go Outdated Show resolved Hide resolved
…till-works-with-many-batches

Conflicts:
	tests/msc2716_test.go
MadLittleMods added a commit that referenced this pull request Jan 13, 2022
@kegsay
Copy link
Member

kegsay commented May 11, 2022

@MadLittleMods what is the status of this PR? It's still marked as a Draft.

tests/msc2716_test.go Outdated Show resolved Hide resolved
MadLittleMods added a commit that referenced this pull request May 12, 2022
Also includes `paginateUntilMessageCheckOff` from
#214
MadLittleMods added a commit that referenced this pull request May 24, 2022
Synapse changes: matrix-org/synapse#12718

Part of MSC2716: matrix-org/matrix-spec-proposals#2716

We're testing to make sure historical messages show up for a remote federated homeserver even when the homeserver is missing the part of the timeline where the marker events were sent and it paginates before they occurred to see if the history is available. Making sure the homeserver processes all of the markers from the current state instead of just when it sees them in the timeline.

Sending marker events as state now so they are always able to be seen by homeservers (not lost in some timeline gap).

Marker events should be sent with a unique `state_key` so that they can all resolve in the current state to easily be discovered.

 - If we re-use the same `state_key` (like `""`), then we would have to fetch previous snapshots of state up through time to find all of the marker events. This way we can avoid all of that.
 - Also avoids state resolution conflicts where only one of the marker events win

As a homeserver, when we see new marker state, we know there is new history imported somewhere back in time and should process it to fetch the insertion event where the historical messages are and set it as an insertion extremity. This way we know where to backfill more messages when someone asks for scrollback.

Also includes `paginateUntilMessageCheckOff` from
#214
…ll-works-with-many-batches

Conflicts:
	tests/msc2716_test.go
@kegsay kegsay added the stale This issue or PR is old and may be closed soon label Sep 20, 2022
@kegsay
Copy link
Member

kegsay commented Sep 20, 2022

Marked as stale; this PR will be closed in 7 days if there isn't any further progress. Please re-open when you get time. Long-lived branches are fine, but long-lived PRs like this always being in Draft is not fine.

MadLittleMods added a commit to matrix-org/synapse that referenced this pull request Sep 27, 2022
Fix #13856
Fix #13865

> Discovered while trying to make Synapse fast enough for [this MSC2716 test for importing many batches](matrix-org/complement#214 (comment)). As an example, disabling the `have_seen_event` cache saves 10 seconds for each `/messages` request in that MSC2716 Complement test because we're not making as many federation requests for `/state` (speeding up `have_seen_event` itself is related to #13625) 
> 
> But this will also make `/messages` faster in general so we can include it in the [faster `/messages` milestone](https://github.com/matrix-org/synapse/milestone/11).
> 
> *-- #13856


### The problem

`_invalidate_caches_for_event` doesn't run in monolith mode which means we never even tried to clear the `have_seen_event` and other caches. And even in worker mode, it only runs on the workers, not the master (AFAICT).

Additionally there was bug with the key being wrong so `_invalidate_caches_for_event` never invalidates the `have_seen_event` cache even when it does run.

Because we were using the `@cachedList` wrong, it was putting items in the cache under keys like `((room_id, event_id),)` with a `set` in a `set` (ex. `(('!TnCIJPKzdQdUlIyXdQ:test', '$Iu0eqEBN7qcyF1S9B3oNB3I91v2o5YOgRNPwi_78s-k'),)`) and we we're trying to invalidate with just `(room_id, event_id)` which did nothing.
Comment on lines +1370 to +1374
// Since this will throw an error if they key does not exist,
// do this at the end of the loop. It's a valid scenario to be at the end
// of the room and have no more to paginate so we want the `return` when
// we've found all of the expected events to have a chance to run first.
messageResEnd = client.GetJSONFieldStr(t, messsageResBody, "end")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out to #489

MadLittleMods added a commit that referenced this pull request Oct 4, 2022
…#489)

As spec'd:

> If no further events are available (either because we have reached the start of the timeline, or because the user does not have permission to see any more events), this property is omitted from the response.
>
> *-- https://spec.matrix.org/v1.1/client-server-api/#get_matrixclientv3roomsroomidmessages*

Split out from #214

Previously, it would throw the following error and potentially we found all of the events within that last response to satisfy everything,
```
JSONFieldStr: key 'end' missing
```
@kegsay kegsay closed this Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue or PR is old and may be closed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants