Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Strip optional keys from /sync, take 3. #10479

Closed
5 tasks
erikjohnston opened this issue Jul 26, 2021 · 14 comments
Closed
5 tasks

Strip optional keys from /sync, take 3. #10479

erikjohnston opened this issue Jul 26, 2021 · 14 comments
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@erikjohnston
Copy link
Member

We want to strip out optional keys by default for /sync, see #9919. This caused issues, so got reverted due to breaking clients #9940, and then got re-introduced again in v1.38.0 #9941 / #10214. It was discovered that this broke Element Android, and so has been disabled again in v1.38.1 #10456 / #10457.

We should reintroduce this change once the Element Android release has propagated sufficiently.

To try and de-risk re-enabling the change we should come up with a plan for how to roll the change out more slowly. The current suggestion is:

  • Add a config flag that defaults to the current behaviour (including the key in responses)
  • Enable it on matrix.org for a while.
  • Flip the flag's default to instead omit the key
  • Let that simmer on develop, then RC, then in a release.
  • Remove the flag.
@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 26, 2021
@erikjohnston
Copy link
Member Author

Personally, I think if we release Synapse with the config option enabled by default and then discover an issue, we should not tell people to change their config but instead roll out another hotfix, for two reasons:

  1. We should release a hotfix anyway, as that (IMO) is going to reach more people than us trying to tell people to flip some config; and
  2. It means if we try and re-enable the flag by default a lot of servers will have explicitly overridden the default behaviour and so will not pick up the change in default, leading to fewer servers with the change enabled, making it more likely that we'll miss any bugs for longer.

i.e. the config flag is really just to make it easier for us (Synapse devs) to test it on our own deployments without risking enabling it by default, rather than an escape hatch for Synapse server admins.

@richvdh
Copy link
Member

richvdh commented Jul 26, 2021

It was discovered that this broke Element Android, and so has been disabled again in v1.38.1 #10456 / #10457.

to be clear: #10457 is a partial reversion: only device_one_time_keys_count has been reintroduced.

@deepbluev7
Copy link
Contributor

As @richvdh says, only the device_one_time_keys_count reverted. Furthermore, I don't think Element Android works with that field left out even after the fix. It only updates OTKs, if it finds that field in /sync. If it is not present, but all the OTKs got exhausted in the mean time, with my changes synapse would not send that field down /sync (since there are no OTK counts inside). That would still break Element Android, if OTKs get exhausted after initial sync. See the discussion on element-hq/element-android#3724

I might be misunderstanding things. I am not familiar with the Element Android codebase nor am I familiar with kotlin, but that is what I think I am understanding, when reading that code. So I don't think device_one_time_keys_count can be removed when empty without breaking Element Android in some edge cases.

So imo the following things need to be investigated still:

  • If all the OTKs get exhausted server side, does synapse send "signed_curve25519": 0", or does it currenlty just send an empty OTK object?
  • If it sends an empty OTK object, does Element Android still upload OTKs, if it is not an initial sync?

@bmarty
Copy link

bmarty commented Jul 26, 2021

It only updates OTKs, if it finds that field in /sync

This is not true, if Element Android does not know how many OTK the server has, it explicitly asks it.

For your first question @deepbluev7 the answer MUST be yes, if the info has not been sent to the client yet.

For the second question, the answer is no, if the known count of OTK does not require to send new OTKs (So I think this is fine). If the count change, the sync response will contain the new count.

So to fix the issue properly, the server MUST store what was the last count sent to the client, and send the new count each time the count change.

@bmarty
Copy link

bmarty commented Jul 26, 2021

So to be clear, as a client POV:

  • if OTK object is empty, the client understand that there is no change. This rule is true for every other fields of a sync response. Client should explicitly ask for the number of OTK if they do not have the info (for instance the info was kept only in memory and the app was killed, or after an init sync)
  • if OTK object is not empty, clients can read the number of remaining OTKs and decide to send more OTKs if necessary.

@deepbluev7
Copy link
Contributor

I am pretty sure I have seen synapse not send the count, if the count is 0, even if you had OTKs before and they just got exhausted. This messed up our E2EE a few times. I'm not sure if Element Android handles that specific edge case correctly, but it sounds from your explanation, that it at least handles the other edge cases correctly now and leaving out the device_one_time_keys_count should be safe. Thanks for the explanation @bmarty! I think if synapse does not send an update once the OTKs drop to 0, that is a synapse bug then?

@richvdh
Copy link
Member

richvdh commented Jul 26, 2021

I am pretty sure I have seen synapse not send the count, if the count is 0, even if you had OTKs before and they just got exhausted.

looking at the code that got reverted in #10457, I think that is the behaviour I would expect. AFAICT Synapse does not keep a record of what was the last count sent to the client, and rather (pre-#10457) it simply omitted the field if there are no one time keys. That does indeed seem like a synapse bug.

In the light of that, it seems pretty clear to me that we wouldn't want to reintroduce this code in its previous form.

So the question is, would we want to rework it so that it can be landed. My impression is that it wouldn't be worth it for the sake of about 50 bytes per /sync response. In any case I don't think it needs to be a priority.

@deepbluev7
Copy link
Contributor

No, I don't think this needs to be introduced that badly. I think focusing on removing the empty keys from inside join and such would gain us more savings.

@bmarty
Copy link

bmarty commented Jul 27, 2021

I've created #10484 to track the issue reported by @deepbluev7 .

Also @deepbluev7, instead of stripping the sync response after it is computed by the server, isn't it possible to fix the problem of empty Json fields when the sync response is built?

@deepbluev7
Copy link
Contributor

@bmarty I am not a synapse developer, I just fixed a small issue that annoyed me. But what my changes do is not add specific fields to the response, if their value would be empty. So I think we are already doing the "when the sync response is built", just that at the top level of the response the difference is not that big. Unless I am misunderstanding something?

@bmarty
Copy link

bmarty commented Jul 27, 2021

OK. To clarify:

  • Synapse builds a sync response, with some empty fields by default, intended to be populated during this process. I do not know the details here.
  • After the sync response is built, your PR strip the empty fields before it's sent to the client

My point is: at first step do not provide empty Json, but add the fields only if necessary

@deepbluev7
Copy link
Contributor

No, my PR doesn't add toplevel empty fields at all. Look for example at the to_device messages or the joined rooms. First a function is called, that calculates if there are to_device messages or updates in joined rooms. Those return either the update in that section or an empty list. You need to call that anyway to know, if you have an update. Then my code checks if the list is not empty and adds it to the json in that case. So my PR changes the place where the json is built to not add those fields, not process the json afterwards to remove those fields. The only limitation is that it currently only does this for the top level. At the lower levels of the /sync response there are still some empty lists added to the response, so they get serialized as such. You also need to be careful to not add timeline: { events: [] }, where you don't add the timeline, if the events list is empty. But in general I think my PR is already doing what you want? But if you know how to improve it, I would be happy to see that!

@bmarty
Copy link

bmarty commented Jul 27, 2021

Oh I see, my bad, I've read again your PR, and I understand what it does. I was thinking about this other PR: #8336, which probably fixed completely #6579, but was rejected, apparently because it broke the unit test.

@richvdh
Copy link
Member

richvdh commented Jul 27, 2021

ok so it seems like whatever we do we don't want to reintroduce this change as it stands. In any case we had better resolve matrix-org/matrix-spec-proposals#3298 before doing so.

Accordingly, I'm going to close this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

4 participants