Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: handle total failure to sync state #13000

Open
richvdh opened this issue Jun 9, 2022 · 5 comments
Open

Faster joins: handle total failure to sync state #13000

richvdh opened this issue Jun 9, 2022 · 5 comments
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jun 9, 2022

Currently, if we try every server in the room and are unable to sync state from any of them, we give up, leaving us with a room stuck in "partial state" state, and any C-S requests for state in that room timing out indefinitely.

It's not entirely clear what we should do in this case:

  • Giving up isn't the right thing to do if there's a temporary network outage
  • Retrying indefinitely is also not the right thing to do if we can reach all homeservers and they all claim they don't have the state we want.

if attempt == len(destinations) - 1:
# We have tried every remote server for this event. Give up.
# TODO(faster_joins) giving up isn't the right thing to do
# if there's a temporary network outage. retrying
# indefinitely is also not the right thing to do if we can
# reach all homeservers and they all claim they don't have
# the state we want.
# https://github.com/matrix-org/synapse/issues/13000
logger.error(
"Failed to get state for %s at %s from %s because %s, "
"giving up!",
room_id,
event,
destination,
e,
)
raise

@richvdh richvdh added this to the Faster joins (further work) milestone Jun 9, 2022
@richvdh richvdh added A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 9, 2022
@reivilibre
Copy link
Contributor

As you say, it's not clear what we want in this case.

Should we eventually boot the user(s) out of the room and shut it down, pretending it never happened? That sounds pretty janky, but perhaps defensible if the UI makes it clear that you're not 'properly joined' whilst the partial join is going on. Is that something we'd want to do?

@richvdh
Copy link
Member Author

richvdh commented Jul 28, 2022

I kinda think that's what we'll have to do, ultimately, though we'd probably have to figure out a way to get the memo to the clients about the reason we're giving up on the room. To be honest that sounds like a general problem - "we've given up on this room" can happen for other reasons (notably: it getting shut down by an admin) - so this might need spec changes.

@erikjohnston
Copy link
Member

Can we do a out-of-band leave like we do for rejecting invites? I think that would end up doing roughly the right thing? I'm kinda assuming this situation would be rare enough that we don't need to worry too much about making the UX slick, so long as we end up in a sane state.

@squahtx
Copy link
Contributor

squahtx commented Sep 30, 2022

@H-Shay
Copy link
Contributor

H-Shay commented Sep 8, 2023

So I took a stab at this and have a branch where I did an out-of-band leave when syncing hit the total failure state (and a test for this). However, I then realized that the code that I called to process the leave was only defined on the master, and so this solution would not work for worker instances. This is as far as I got with it. I've pushed the branch here if that's helpful for anyone.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

5 participants