Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify which events get returned when requesting initial timeline events #366

Merged
merged 3 commits into from
Nov 6, 2023

Conversation

kegsay
Copy link
Member

@kegsay kegsay commented Nov 6, 2023

As outlined in #365, we were trying to be too smart before by calculating all the ranges for the room and returning them as a slice of from/to positions. This neglected to consider when history visibility was set to world_readable/shared/invited. In those scenarios, if the invite and join were in close proximity in the timeline (which is fairly typical), such that the timeline_limit would encompass both the invite and the join, the calculated timeline would incorrectly omit the events between the invite and the join. This is very bad, because it means the client has no way to fetch those messages using /messages, as the prev_batch token will relate to the invite (timeline[0]).

This PR fixes this by only returning a single range, the last joined range for that user. This ensure the prev_batch token is always correct, so the client can hit /messages with it and the upstream HS can calculate history visibility.

It's worth noting that there isn't much of a tradeoff here. The old behaviour was worse in almost all circumstances. It was only "better" in the rare case of:

  • the room has joined history visibility
  • the timeline limit encompasses multiple join/leave memberships for the connecting user.

In that one case, the proxy would be able to return more timeline events correctly, thus potentially cutting the need for a /messages hit.

It looks like this bug has sat in the codebase since the beginning, and whilst we do have membership transition tests, and we did know we were conservative on history visibility checks, we didn't combine the two in any tests to check that clients can get the data they need. This PR also adds a regression test which explicitly hits /messages with the provided prev_batch token to ensure that clients see the right data.

If we returned multiple distinct ranges, we always assumed that
the history visibility was "joined", so we would never return
events in the invite/shared state. This would be fine if the client
had a way to fetch those events sent before they were joined, but
they did not have a way as the prev_batch token would not be set
correctly. We now only return a single range of events and the prev
batch for /that/ range only, and defer to the upstream HS for
history visibility calculations.

Add end-to-end test to assert this new behaviour works.
@kegsay kegsay added the bug Something isn't working label Nov 6, 2023
state/storage.go Outdated Show resolved Hide resolved
state/storage.go Outdated Show resolved Hide resolved
Comment on lines +879 to +880
// we only care about the LAST nid range, otherwise we can end up with gaps being returned in the
// timeline. See https://github.com/matrix-org/sliding-sync/issues/365
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wording is going to confuse my future self. (Ditto for the other comment citing this issue). It makes it sound like "gaps in the timeline" is fundamentally bad---but we do want there to be events not sent to users in certain situations (e.g. joined history vis).

"Gaps in the timeline" means presumably means(?): the proxy returns a timeline [A, ...., B, C, ... D] such that

  1. B is not a direct ancestor of C,
  2. there is some event between B and C that the user has permission to see, and
  3. there is nothing in the final sync response which tells clients about points (1) and (2)?

Maybe we can say "otherwise we can end up with incorrect gaps in the timeline"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno, for me this seems clear. A "gap" fundamentally is in between two things. In this case, there is a gap between two timeline events e.g:

[A,B,C,D,E] true timeline
[A,B,D,E]   calculated timeline with a gap

Hence, "gaps in the timeline" are fundamentally bad, as it means we have omitted an event.

Technically, we could be pedantic and say that a "gap" is also present when you have:

[A,B,Leave, .... , Join, D, E]

and that this is a "correct" gap (assuming joined history visibility), but this has a ton of caveats and ultimately isn't something the proxy should ever do anymore as of this PR. The moment we start differentiating between correct (which won't be producible by the proxy) and incorrect gaps, it just adds more ultimately unnecessary lexicon imo.

This is further supported by how end-users describe it in bug reports:

  • "I have gap in my timeline"
  • "There are gaps in my timeline"
  • "Sent a message from my matrix.org account. Got notification on EIX, opened it from there. When the timeline loaded, there is a gap in the timeline"

Let's use the same term to mean the same thing, a bad thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also doesn't help that we speak of "gappy" v2 syncs etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants