Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Replication protocol #5

Open
pgte opened this issue May 29, 2018 · 0 comments
Open

RFC: Replication protocol #5

pgte opened this issue May 29, 2018 · 0 comments

Comments

@pgte
Copy link
Contributor

pgte commented May 29, 2018

RFC: peer-crdt-ipfs replication protocol

Peer-crdt was created as an extensible, abstract and understandable way of reasoning about operation-based CRDTs in a content-addressable world like IPFS. Having a native DAG representation of operations allowed us to use IPFS synchronisation primitives without having to care about the details, with all the automatic benefits of being able to use IPLD, opaque replication, content routing, etc.

But we're still facing some performance issues that are inherent to the limited resources on browsers and all the work that needs to happen in js-ipfs to make graph synchronisation to work efficiently.

The goal here is then to create a protocol that allows peers to quickly synchronise the CRDT state and operations between each other in an efficient way. Having successfully implemented this protocol, we can then consider extrapolating some of this into IPFS territory.

Besides trying to build a faster sync, we also want to take this opportunity to address some pending issues like:

  • Form an intentional topology that is less resource-consuming
  • Enable operation truncation (through calculating causal stability)
  • Faster bootstrap for new nodes

Protocol overview

This protocol should allow for a replica to connect to another replica and be able to state what the replica state is. The replica state can be compressed by a vector clock that determines the latest operation that has been received.

Overlay topology

Instead of a everyone-connects-to-all topology we currently have, we need a way have a topology that provides a batter balance between latency with the number of maintained connections. Since the web environment is very constrained, maintaining a high number of connections is to be avoided. In the case of CRDTs, since we don't require consensus to progress, we can afford to have a maximum of f (fan-out) connections per node. Some of these connections will be using eager push and others will be using lazy push. At any given time we can switch an lazy connection into an eager one and vice-versa.

We can use the principles in Epidemic Broadcast Trees to form and maintain such a topology:

Topology protocol

  • Each node has a fanout f, the number of neighbour nodes it uses to gossip with
  • A subset of f uses eager push
  • The other part uses lazy push

The connections with eager push are selected in a way that their closure effectively builds a broadcast tree.
Lazy push connections are used to ensure gossip reliability when nodes fail and to quickly heal the broadcast tree.

In each node, the set of gossip nodes are not changed between gossip rounds. The same peers are used until failures are detected.

To construct the overlay, we use the Spanning Tree Construction and Repair algorithms presented in Epidemic Broadcast Trees.

bootstrap

Here we have two modes of bootstrapping: prime or replica.

In prime mode, a replica is immediately considered bootstrapped. This can be useful when creating a new CRDT instance, where we don't want to block until there is another replica we can sync with.

Bootstrapping in the replica mode

In replica mode, a replica blocks until it's bootstrapped. This is used when the current node is not the CRDT instance creator. Having this mode allows us to maintain the causal stability by avoiding new coming nodes to enter the CRDT directly from the initial (zero) state.

In replica mode, the bootstrap network protocol is performed with a remote replica once we can connect. To optimize, a replica should not initiate the bootstrap protocol with more than one node at the same time.

Bound over a CRDT instance in a mutually-authenticated and private channel, a peer-to-peer connection starts by each peer immediately announcing their current state in the form of a vector clock.

When receiving that remove state, a peer:

  • updates their causal stability table (see further down)
  • computes and sends the missing operations

As a part of the general sync protocol, when sending each one of the operations, the replica updates the connection-bound vector clock that represents the state of the other peer.

The bootstrap protocol is said to be complete once a replica reaches the state (vector clock) the remote peer initially declared.

From then on, the normal sync protocol is in action.

Sync protocol

After bootstrapping, a connection with a remote replica can either be in two states:

  • Eager-push: where updates are sent immediately
  • Lazy-push: where only the vector clock if the latest operation is transmitted

Preemptive (eager) anti-entropy

As said before, when connected to a remote replica, a replica associates the last sent operation to that connection — in the form of a vector clock (bear in mind that this is not the vector clock in the causal stability table - see further down).

When creating an operation or receiving an operation from another replica, a replica should:

  • by looking at the connection from the vector clock, check whether the remote replica has received that operation.
  • if not, send it and update the connection vector clock.

Receiving an operation

When receiving an operation from a remote replica, a replica should:

  • extract the replica ID and the vector clock from the operation
  • update the causal stability table (replica ID -> vector clock)
  • check whether that operation (lookup by vector clock) has already been processed
  • if not, process it
  • forward it to the other connections in eager mode that are missing it

vector-clocks

Besides being content-addressable, the operation store must also be addressable by vector clock.

Also, we don't need to transmit the entire vector clock all the time. As an optimisation, a replica can just send a vector clock delta (only the parts of the vector clock that have changed).

Causal Stability

To be able to to truncate the operation logs, we need to maintain a a causal stability table, as described here.

Membership

As operations contain data of the originating node, they can be used to infer membership.
That can be used as a way to gossip about membership, but we still need a protocol for a node to leave membership.

The easiest way here is to create CRDT operations that signify membership changes. A node, when joining the membership, creates a membership operation that will be replicated. When leaving, another membership operation gets added.

Timeout

When a node is offline for some time it will prevent causal stability from progressing, which will prevent op-log truncation from progressing. Here we need a protocol to remove a foreign node from membership (either triggered by a timeout or a manual operation). To progress, this protocol must touch all remaining members, to make sure that no member has a stored an operation from the node about to leave.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant