Modify which events get returned when requesting initial timeline events #366

kegsay · 2023-11-06T12:04:06Z

As outlined in #365, we were trying to be too smart before by calculating all the ranges for the room and returning them as a slice of from/to positions. This neglected to consider when history visibility was set to world_readable/shared/invited. In those scenarios, if the invite and join were in close proximity in the timeline (which is fairly typical), such that the timeline_limit would encompass both the invite and the join, the calculated timeline would incorrectly omit the events between the invite and the join. This is very bad, because it means the client has no way to fetch those messages using /messages, as the prev_batch token will relate to the invite (timeline[0]).

This PR fixes this by only returning a single range, the last joined range for that user. This ensure the prev_batch token is always correct, so the client can hit /messages with it and the upstream HS can calculate history visibility.

It's worth noting that there isn't much of a tradeoff here. The old behaviour was worse in almost all circumstances. It was only "better" in the rare case of:

the room has joined history visibility
the timeline limit encompasses multiple join/leave memberships for the connecting user.

In that one case, the proxy would be able to return more timeline events correctly, thus potentially cutting the need for a /messages hit.

It looks like this bug has sat in the codebase since the beginning, and whilst we do have membership transition tests, and we did know we were conservative on history visibility checks, we didn't combine the two in any tests to check that clients can get the data they need. This PR also adds a regression test which explicitly hits /messages with the provided prev_batch token to ensure that clients see the right data.

If we returned multiple distinct ranges, we always assumed that the history visibility was "joined", so we would never return events in the invite/shared state. This would be fine if the client had a way to fetch those events sent before they were joined, but they did not have a way as the prev_batch token would not be set correctly. We now only return a single range of events and the prev batch for /that/ range only, and defer to the upstream HS for history visibility calculations. Add end-to-end test to assert this new behaviour works.

state/storage.go

tests-e2e/timeline_test.go

DMRobertson · 2023-11-06T12:29:05Z

state/storage.go

+		// we only care about the LAST nid range, otherwise we can end up with gaps being returned in the
+		// timeline. See https://github.com/matrix-org/sliding-sync/issues/365


I think this wording is going to confuse my future self. (Ditto for the other comment citing this issue). It makes it sound like "gaps in the timeline" is fundamentally bad---but we do want there to be events not sent to users in certain situations (e.g. joined history vis).

"Gaps in the timeline" means presumably means(?): the proxy returns a timeline [A, ...., B, C, ... D] such that

B is not a direct ancestor of C,

there is some event between B and C that the user has permission to see, and

there is nothing in the final sync response which tells clients about points (1) and (2)?

Maybe we can say "otherwise we can end up with incorrect gaps in the timeline"?

I dunno, for me this seems clear. A "gap" fundamentally is in between two things. In this case, there is a gap between two timeline events e.g:

[A,B,C,D,E] true timeline [A,B,D,E] calculated timeline with a gap

Hence, "gaps in the timeline" are fundamentally bad, as it means we have omitted an event.

Technically, we could be pedantic and say that a "gap" is also present when you have:

[A,B,Leave, .... , Join, D, E]

and that this is a "correct" gap (assuming joined history visibility), but this has a ton of caveats and ultimately isn't something the proxy should ever do anymore as of this PR. The moment we start differentiating between correct (which won't be producible by the proxy) and incorrect gaps, it just adds more ultimately unnecessary lexicon imo.

This is further supported by how end-users describe it in bug reports:

"I have gap in my timeline"

"There are gaps in my timeline"

"Sent a message from my matrix.org account. Got notification on EIX, opened it from there. When the timeline loaded, there is a gap in the timeline"

Let's use the same term to mean the same thing, a bad thing.

it also doesn't help that we speak of "gappy" v2 syncs etc.

kegsay added 2 commits November 6, 2023 10:44

Add regression test for #365

89c4198

kegsay added the bug Something isn't working label Nov 6, 2023

kegsay requested a review from DMRobertson November 6, 2023 12:04

DMRobertson reviewed Nov 6, 2023

View reviewed changes

state/storage.go Outdated Show resolved Hide resolved

DMRobertson reviewed Nov 6, 2023

View reviewed changes

state/storage.go Outdated Show resolved Hide resolved

DMRobertson reviewed Nov 6, 2023

View reviewed changes

tests-e2e/timeline_test.go Outdated Show resolved Hide resolved

DMRobertson reviewed Nov 6, 2023

View reviewed changes

Review comments

6f2a00a

DMRobertson approved these changes Nov 6, 2023

View reviewed changes

kegsay merged commit 421a572 into main Nov 6, 2023
7 checks passed

kegsay mentioned this pull request Jan 5, 2024

EW: scrollback from before you joined a room isn't visible on SS #8

Closed

This was referenced Mar 5, 2024

Initial sync uses a ton of memory on Synapse's synchrotron #258

Open

Synapse doesn't backfill missing events when rejoining a room later element-hq/synapse#15717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify which events get returned when requesting initial timeline events #366

Modify which events get returned when requesting initial timeline events #366

kegsay commented Nov 6, 2023

DMRobertson Nov 6, 2023

kegsay Nov 6, 2023

DMRobertson Nov 6, 2023

		// we only care about the LAST nid range, otherwise we can end up with gaps being returned in the
		// timeline. See https://github.com/matrix-org/sliding-sync/issues/365

Modify which events get returned when requesting initial timeline events #366

Modify which events get returned when requesting initial timeline events #366

Conversation

kegsay commented Nov 6, 2023

DMRobertson Nov 6, 2023

Choose a reason for hiding this comment

kegsay Nov 6, 2023

Choose a reason for hiding this comment

DMRobertson Nov 6, 2023

Choose a reason for hiding this comment