-
Notifications
You must be signed in to change notification settings - Fork 6
CRDT: Garbage Collection #2
Comments
Ideally, all participating nodes should agree in a snapshot at some point in time, being that what is transmitted to new joining nodes first, followed by the operations. Garbage collection is an active research topic but, to my knowledge, it boils down to initiate and being able to get consensus on a checkpoint. To get consensus, you have to know the participating replicas and then get a majority of those replicas to agree on a that checkpoint. The checkpoint would then consist of the state of replica at that point (entry CID). So, as far as I know, we would need a consensus protocol attached to the CRDT in order to get garbage collection / compaction. |
Interesting discussion around ORDTs and garbage collection in this research issue: |
With a hash of the previous operations embedded within further operations, nodes can participate the network with just the latest entries. It's the sort of the thing which both Git and the Blockchain use for storage optimizations & validations. Checkpoints could be hard-coded, for example, after 1000 operations, everyone needs to be synchronized on that point to start a new chain from scratch. The checkpoint could be regarded as a new "genesis block" in some sense. After that, all the previous operations/entries could be pruned without any problem. |
(I've not read that paper, tho. I'll read soon as possible.) |
@marcoonroad sounds like a good strategy in general, but since there is no consensus layer in CRDTs, how do we make this converge? |
Oh, you're right. CRDT are consensus-free. There's no valid Byzantine general story to agree upon, since all Byzantine stories are valid and eventually a derived common story is listened by all nodes. |
Oh, and sorry. Synchronization on a distributed/P2P network is almost impossible. I have forgot that entirely! CRDTs are a quite complex & recent topic. |
I've been thinking about this problem for a bit but TBH, haven't checked much literature on the subject. So, I'm sorry if I'm suggesting something which might be trivial or useless because it has been described somewhere else 😅 Rationale I agree with @pgte when he says that GC seems to boil down to 'initiate and being able to get consensus on a checkpoint.' My initial idea on this would be for every replica to keep track of what it knows about the network, more specifically which replicas exist, what are their perspectives on the network and their last CRDT operation applied. This way, we can decide locally when all the replicas agree on a state that can be snapshotted. The network state is kept locally in a If we make sure that the How? The operation based CRDT keeps a map with their current network visibility. The map's key is a replica (ReplicaID) and the values consists of the last operation (operationID) applied to the respective replica, as well as the replicas the node knows exist in the network. E.g. with replica R1 and R2 and latest operation O1 applied locally in both replicas, both of the replicas would have the following
network_map in R1 and R2 When all the values in the local When a new replica (R3) joins the network, it asks one of the existent replicas for the current state (e.g. asks R1). So, we get the following network maps in each replica:
network_map in R1
network_map in R2 and R3 In this case, although the last operation is the same, the R2 and R3 cannot snapshot the CRDT, since the R1 hasn't doens't know about R3 yet and it could have changed its local state between when R1 and R2 where in consensus and then R3 joined the network. Everytime a local replica applies an operation locally, updates its entry on the When a replica receives a new
Everytime the -- The main goal of the This is only conceptual idea but I haven't found any case yet that would make this snapshotting consensus model going wrong. I will try to draw some interaction diagrams and try to find flawed cases. |
I think I understand this proposal, which I think can be resumed to: When I think all the nodes have the same view of the CRDT, I make a snapshot. One question: when nodes diverge (make concurrent edits), each will have a different HEAD operation. Although their state will be eventually equal, the last operation is potentially different: they can receive operations in different orders, etc. So I think you can generalise this to a map where you keep track of each node's latest seen operation. Another question: why do you need to gossip the known state? The way I see it, you can update the local view from the CRDT gossip that already happens. When a node says "I have a new HEAD and it's this", you can update your local network representation immediately. Another question: After updating the local network view, another node in the CRDT can continue to perform local operations concurrently while a this replica is computing the snapshot. Do you think this could be a problem? If so, should we create the snapshot inside the operation tree (saying something like "this snapshot is a child of operations A, B and C". @gpestana thoughts? |
@gpestana ah, I realize that you still need gossip to know what is the network view of other nodes (the list of nodes). |
Hi everyone. It's something like this: Note that stability is a local property, meaning that, if m is stable on some node, it's not necessarily stable on all. |
Thanks for the comments @pgte. My thoughts:
Good point! The
Exactly. The idea is for the node to keep not only his view of the network but also everyone else's view. This way we can decide locally whether - and when - it's safe to snapshot.
Yep, that's a good point. But I believe it should not influence the snapshotting point, since after an operation is applied locally, the local network map will be 'tainted'. Only when all the replicas in the network map will have applied the operation set, the snapshot point is reached. And this can be checked by inspecting the the local network map.
That sounds like a good solution too. Would that mean appending a snapshot once in a while and the replicas would pick the snapshot once all the operations (A,B,C in your case) would be fulfilled? If so, what do you think about overhead? @vitorenesduarte hey there! that's a very good point. Do you know of any way to reach some sort of "strong stability" in which locally you know that all replicas have reached common ground (applied all operations) some at some point in the past? This is what this network map is try to achieve by basically keep everyone's view of the network locally and their last HEAD/list of operations applied in each replicas. |
@gpestana Ah, sorry, I was thinking more of how using this you could create a snapshot (to perhaps sync remote nodes faster), but this is directly related to garbage collection. Which means that a replica is going to truncate all the operations below this "snapshotting point", right? Let's consider then this scenario:
Does this make sense, is it a problem? |
@gpestana A consequence of all next operations being in the future of some stable operation m, |
@pgte that makes sense. Let's say this network has only 3 replicas (rA, rB, rC). The assumption is that if rC is part of the network, rA or rB know about it and keep rC in their network maps (because either of them shared the state with rC when it joined the network - let's say rC was the last node to join the network). Thus, rA and rB can only truncate the document when rC has the same network map as them so that all nodes know that everyone exists and that the a minimum set of operations were applied in each replica. This relies on the fact that every time some replica joins the network (rN), it will ask for the current document to at least one replica (rA). rA then adds rN to the network map. From now on, rA will gossip in his network map that rN exists. So whenever other replicas see rA the new network map, they will add rN to the network map, which will create a dependency so that no one will truncate the document before time. I'm trying to find a way to write down/make diagrams of these interactions 😅 |
@gpestana an interaction diagram of this would be super :) So, at one time rB can be the only one to know about rN (maybe because rB just now discovered rN). @gpestana makes sense? |
I have no knowledge of how IPFS works but I was just reading about CRDTs and thinking about garbage collection. It's funny that I stumbled upon this discussion just as it is happening. Here is my idea: can we get nodes to agree implicitly on which CRDT states are checkpoints by observing a common rule? For example we could calculate a hash of the state and decide to checkpoint only if the first n bits of the hash are zeroes (just like proof of work), then calibrate n so that a checkpoint state is on average generated every few minutes. When such a checkpoint state is found it is broadcast to the network. Every node rebases its changes upon the latest discovered checkpoint state. Sorry if this is stupid or irrelevant. |
@Alexis211 yes, I think that's a valid protocol where you replace explicit consensus with implicit. But I think this is better suited to creating snapshots (for faster sync), but can't be safely be used to truncate the operation log (because of the nodes that may be offline). |
adding @satazor and @marcooliveira, they're interested in GC for CmRDTs. |
So I've recently learned about causal stability (from @vitorenesduarte and a colleague I cannot find on Github). Let me see if I get this right: A given operation (or, more generally, message) is causally stable if all operations to be received are in the future of this operation. This means that, if am operation is stable, we don't expect more operations that are concurrent to this operation, and thus we can compact the operation log of all operations preceding it. So, in order for a replica to be able to compact the log, it needs to know which operations are causally stable. This can be achieved by a) knowing the entire replica membership and b) storing the latest vector clock for each replica. This can be achieved by, when receiving a new operation from a replica, that operation contains causal information in the form of a vector clock. For every operation that can be causally delivered (for which we have delivered all the dependent operations), we store that vector clock. We can, for instance, if we have replicas A, B and C, in replica A we can build a table containing all the vector clocks replica A knows for each peer:
As time progresses, replica A starts receiving operations and will then update this table to, for instance:
Replica A can then infer which is the message that is causally stable by doing a point-wise minimum on all vector clocks.
We can then infer that the all the operations with vector clocks equal or inferior to What about nodes joining?When a new node D joins, it has to introduce itself to another node and then get it's state. Let's say that node D connects to node C to get the most recent state. Now, node C knows about node D, which will start from the state that node C has. Even if, while node D is bootstrapping, node C progresses, node C still knows about node D, and will create a corresponding entry on the it's "causal stability table" and so will never compact history below it. What about nodes leavingThe problem with this approach is that a node should have a leave protocol. If a node knows it's leaving, it should notify other nodes so that it gets removed from the table. If a node crashes and is removed from that table, either:
|
(Found @gyounes on Github. Thank you!) |
Yes, that's a very good approach to find a snapshot point locally, IMO. I was trying to explain the same above but you did it so much better and clearly! Also, I recently found out that this is similar to the idea of maintaining matrix clocks locally, instead of vector clocks. With matrix clocks the local replica maintains a view of the whole network and locally decide when to snapshot - just as you mentioned. I was concerned with what would happen if some node leave the network permanently, creating a 'snapshot deadlock'. As you mentioned, if a leaving mechanism and node timeout are in place, then it should be fine! I recently read some papers presenting similar solutions: |
Yes, it’s a similar approach, with the same efffect. |
@pgte, so we didn't get the time to talk about how op-based CRDTs are used in IPFS. Also, are the that you use merkle trees causal ? |
@gyounes I'm deferring this to a specific issue I created for this purpose. :) |
I was wondering why the current implementation does not require something like VCs, and, from what I understand, it's because all causal information is encoded in the DAG of operations. [1] https://en.wikipedia.org/wiki/Lowest_common_ancestor |
@vitorenesduarte yes, that's a good point. The thing with vector clocks is that this computation can be done without I/O, while the other we need to traverse the DAG nodes. On the other hand, the latest DAG nodes can be kept in a memory cache, but it will still be less efficient that using VCs. |
My personal intuition is rooted in Web systems, where lots of clients come and go. It is impossible to differentiate temporary departures from permanent departures. |
@pgte & others, I'm very confused - I've spent ~9 years on CRDTs and last 4 building GUN ( https://github.com/amark/gun ) which is the most popular CRDT library out there and one of the most popular P2P/decentralized projects in general. I am confused because: Most CRDTs resolve state and compact automatically - thus why they are called commutative. Why is GC even being discussed? The only CRDTs that don't are ones that need to emulate 0-sum or centralized logic, for instance, a counter (the most classic CRDT, admittedly, but can be implemented in as little as 12 lines of code), that need history. This is because they are 0-sum/centralized/Newtonian that they need a history, not because CRDTs in general need a history, GUN's does not need history, it automatically converges state as it goes, and can already handle terabytes/daily traffic on decentralized Reddit in a P2P mesh network. |
@amark Even though there may have been some conflict-free structures and work before that, the term was officially coined in the seminal paper in 2011... About garbage-collection:As described in the seminal paper, while state-based CRDTs messages are commutative, operation-based CRDTs use an operation log and require causal message delivery, so the need for a log compaction strategy that is safe. In the recent days I've had some CRDT researchers kindly explaining to me and the team how to do garbage collection in operation-based CRDTs, some of them commenting in this very issue, I recommend you check it out if you want to understand why this problem matters. |
The term was formally defined in 2011, I believe.
It does not look that crowded though.
Regarding the "commutative" part: |
@pgte @gritzko I know, I didn't hear the term/phrase CRDT until several years later. Doesn't mean people weren't working on them prior to 2011. Obviously my bias is for state based CRDTs, having an op-based CRDT just sounds like a lot of event-sourcing / append-only log work (I abandoned this approach in 2012 / 2013 when I couldn't get it to scale), the state-based approach does though. @gritzko it isn't, but it still pushed a lot of traffic, so I know it can scale (I desperately had to fix some hiccups that they had, still dealing with some storage issues, but network is pretty stable now, although traffic has declined since launch). I hope you aren't being sarcastic dropping a wikipedia article. The fact that an operation is commutative also means its subsequent operations are commutative with the result of the previous operations - aka, the result is the compaction of the prior operation: You don't need to store the history. |
That's for sure. I had Causal Trees working in 2008 when B.Cohen pointed out its similarity to weave. And weave is grey-bearded classics of revision control (1972).
I can't disagree more.
Add-to-log is a perfectly commutative op if we produce a sorted timestamped log. |
@gritzko nice!!! If Bram 👍 it, then that gives high-confidence it is good. I recently had the honor of meeting him after doing a lightning talk before him and got to chat with him in his office for a few hours. If you live in the Bay Area, lets meet (I just moved here, finally!). Sorting logs is ~ O(N) which === my comment about it not being scalable. Could you elaborate on why you disagree? |
Sorting is O(NlogN). That is for the entirety of the process, assuming unsorted inputs. |
Right now we opted for ∂-state-CRDTs to avoid having to deal with this issue. |
@pgte why persistent culture of not collaborating with other projects, but then just trying to copy cat them? |
At one point during the summit, Feross got up and demoed video streaming/torrenting in the browser (a demo he has done for several years now) - then the next speaker got up, and announced IPFS was working on a new experimental custom video streaming/torrenting library. That would be fine if IPFS / WebTorrent were not aware of each other, but we've all been talking and demoing our tech to each other since 2014. There are other examples, like I replied in the other thread with CRDTs and GUN. If you guys change your mind, let us know - we're quite open and the chatroom is very friendly. |
@amark this is totally out of topic and out of our control, those were community projects. If you want to contribute to the specific problem stated in this issue, please do. |
Problem:
operation-based CRDTs work by nodes storing and forwarding operations to other nodes, in an append-only ever growing graph of operations.
This has a few draw-backs:
The text was updated successfully, but these errors were encountered: