get_auth_chain_difference is super expensive #8622

richvdh · 2020-10-21T21:50:06Z

The get_auth_chain_difference transaction can often take hundreds of seconds of database time (during which all other traffic to the room is queued). This is a large component of database load from state resolution.

Caching the results of this transaction wouldn't be useful as each result is typically used only once (to populate the cache of state-resolution results), but could we cache some of the inputs which could get reused across multiple transactions (for example, the list of auth event ids for each event)?

The text was updated successfully, but these errors were encountered:

erikjohnston · 2020-10-26T15:44:26Z

Another thought here is that we may be able to change the database schema to make it more efficient to query. I'm not sure what this looks like in practice, but could be something like splitting up the auth chains into larger blocks to make it easier to reduce the search space? Or representing the graph in a different format?

erikjohnston · 2020-11-17T17:11:07Z

but could we cache some of the inputs which could get reused across multiple transactions (for example, the list of auth event ids for each event)?

This was done in #8672

erikjohnston · 2020-11-18T11:31:50Z

I've been thinking and experimenting a bit with this, and so I'm going to jot down a few notes here.

Having found a couple of examples of really slow calls to get_auth_chain_difference the problem appears to be when you have long chains of e.g. membership of events for a single user. This results in the algorithm taking a while walking just walking that auth chain.

The current algorithm works by walking the the auth event graph in depth order, keeping track of which events are reachable from the different state sets. Looking at the implementation and some sample runs I don't see any obvious dramatic performance improvements, and so we will likely need to make changes to the algorithm to get the necessary improvements.

Having searched around a bit I can't see anything in the literature that is directly applicable here, though there is quite a lot about efficient reachability lookups that is relevant. What we're likely looking at is adding some form of "index" that will helps us to more quickly compute the auth chain difference. Some of the key properties we're looking for here are:

The index used isn't "too large" (i.e. sizes of O(n^2) are out).
Supports "online" updates, i.e. adding new nodes doesn't invalidate entries in the index (assuming we don't add nodes with incoming edges). AFAICT there isn't much talk of online indices, though some algorithms do look like they can be adapted to support online updates.

http://www.cse.ust.hk/~dimitris/6311/L24-RI-Parchas.pdf has an interesting table comparing various algorithms and their performance vs index size properties, with rough overviews of each. It's worth noting that a lot of the performance characteristics depend on the "shape" of the graph, e.g. how sparse it is.

My current avenue of attack is to note that the transitive reduction of the auth chain graphs appear to mostly be very "spikey", i.e. we have long streaks of events which only have a single incoming and a single outgoing edge. This naively feels like something that should be easy to index along the lines of "this event is part of the streak that starts at X and ends at Y", which would allow efficiently skipping to the end (rather than iteratively walking down the chain).

In fact, this is very similar to the idea of a "chain cover" of a graph, where you partition the graph into "chains", like so (taken from the PDF above):

By labelling the chains and adding sequence numbers you can:

Quickly tell how two nodes in a chain relate by comparing their sequence numbers
Indexing (chain_id, sequence number) lets you quickly pull out all events between two events in a chain
Storing where chains point to each other as mappings of (chain_id, sequence number) -> (chain_id, sequence number) lets us do a naive graph comparison on a much smaller graph.
Alternatively we could store the "transitive closure" of that chain graph, i.e. store (chain_id, sequence number) -> (chain_id, sequence number) for every path from one chain to another. Checking if an event (C1, S1) is reachable from another (C2, S2) would then be a case looking up if there is a path from C1 -> C2 which originates at a sequence number reachable from S1 and terminates at a sequence number that can reach S2.

The above can be used to generate the auth chain difference by for each state set computing the maximum sequence number in each chain that is reachable by that state set. The auth chain difference is then all events in each chain between the maximum and minimum sequence numbers reachable by each state set. This is easy in a world where we have index as described in 4 above, but is a bit harder if we only have an index of 3 above. It's not clear to me what the practical cost of 3 vs 4 would be right now.

A simple way of computing a chain cover "online" would be to assign chains for each type/state key pair (though noting we may need multiple chains for a type/state key if there is a fork, but that should hopefully be rare). While it's impossible to generate the most efficient chain cover "online" (since it depends on the shape of the graph, which can theoretically change drastically as we get new events), we do know that long chains in the auth graph will most likely come from events of the same type/state key, and so basing the chains off type/state key should hopefully give us a decent chain covering. (Though we may be able to refine that a bit).

Building up the reachability index when inserting a new state event should be a case of:

Fetch auth events, dropping those reachable by each other (e.g. if you have two joins J1 and J2 both pointing at P1 then we ignore P1). This reduces redundant edges to make the graph closer to its transitive closure (we could also use the existing reachability to compute the actual transitive closure, but this may or may not be worth is).
If there is only one auth event left and its in the same chain, stop.
Otherwise, for each auth event not in the same chain get their chain/sequence number and check if that is reachable by the new event's chain already. If not then we need to update the index (what we insert depends on the type of index being used).

erikjohnston · 2020-11-18T18:22:53Z

An implementation that uses the index from 4 takes about ~100ms on my laptop with postgres compared to ~1s with the old algorithm. So ~10x improvement.

The index tables combined have about are about a third the size of event_auth, which feels acceptable.

erikjohnston · 2020-11-18T18:24:36Z

Next steps here is to figure out how to handle existing rooms that don't have the index. A naive solution would be to default to the old algorithm for existing rooms, but that would suck for existing servers and rooms.

richvdh mentioned this issue Oct 21, 2020

Very slow state res in local rooms #8612

Closed

anoadragon453 added the A-Performance Performance, both client-facing and admin-facing label Oct 22, 2020

richvdh mentioned this issue Oct 26, 2020

Incremental State Resolution #8659

Open

erikjohnston mentioned this issue Nov 12, 2020

Cache event ID to auth event IDs lookups #8752

Merged

erikjohnston self-assigned this Nov 20, 2020

erikjohnston mentioned this issue Dec 3, 2020

Use a chain cover index to efficiently calculate auth chain difference #8868

Merged

erikjohnston closed this as completed in #8868 Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_auth_chain_difference is super expensive #8622

get_auth_chain_difference is super expensive #8622

richvdh commented Oct 21, 2020

erikjohnston commented Oct 26, 2020

erikjohnston commented Nov 17, 2020

erikjohnston commented Nov 18, 2020 •

edited by ara4n

Loading

erikjohnston commented Nov 18, 2020

erikjohnston commented Nov 18, 2020

get_auth_chain_difference is super expensive #8622

get_auth_chain_difference is super expensive #8622

Comments

richvdh commented Oct 21, 2020

erikjohnston commented Oct 26, 2020

erikjohnston commented Nov 17, 2020

erikjohnston commented Nov 18, 2020 • edited by ara4n Loading

erikjohnston commented Nov 18, 2020

erikjohnston commented Nov 18, 2020

erikjohnston commented Nov 18, 2020 •

edited by ara4n

Loading