feat(replica): support multi shard commands new flow #645

adiholden · 2023-01-05T13:44:28Z

@dranikpg
I have some problem with the cancellation in this which I will continue investigate.

Signed-off-by: adi_holden <adi@dragonflydb.io>

dranikpg · 2023-01-05T13:56:52Z

src/server/replica.cc

+  // TODO: Vlad, shouldnt we call here Replica::JoinAllFlows()?
+  // can it be that sync_fb_ of execution_fb_ are valid? Stop is called from server family
+  // not for replica flows
+  if (sync_fb_.joinable()) {
+    sync_fb_.join();
+  }
+  if (execution_fb_.joinable()) {
+    execution_fb_.join();
+  }
 }


The logic is as follows (and its kind of bad the replica and sub-replica are on class - but thats what we have for now).

We cancel the context

It unblocks

The current task (i.e. full/stable sync) must do cleanup when it exists and returns an error - i.e. join the flows

We wait for sync_fb_ to join on the main replica class - so when the loop exited, the task must have done cleanup

We can join all flows - but that shouldn't be needed here - it happens either way. Besides, we can only join if the flows are not in a transitioning state, so we have to lock by the transition mutex... so it makes it all more complicated

I see now that in the main replica sync_fb_ is set to MainReplicationFb. I thought that it is only used in the sub-replica instances and this is why I thought that there was a bug here that you wanted to join the sub flows here. Never mind.
I will remove the execution_fb_.join() from here

You can leave a comment there, that sync_fb_ is the MainReplicationFb and is responsible for stopping everything

dranikpg · 2023-01-05T13:58:50Z

src/server/replica.cc

+  // TODO: Vlad I want to make sure we dont need cleanup here of shared map and flow queues
+  // If I understand the reason we dont have cleanup is because after we exit here replica class
+  // is distroyed?
+


What kind of cleanup? The current cleanup for a flow done by the stable sync function is just joining it

Because the replica is holding state than I would think that after each step we will have a cleanup for the step, f.e in here clear the shared transaction map and queues in each flow. The reason for it is if the step failed to and you want to do recover I would want to have a clean data in replica class.
Unless we are destroying the class after the step fails, as currently done in full sync if we start it again you recreate the replica flow classes.
So what is the flow if the replica connection fails while we are in stable state? the error handler is invoked right? and after that if we are able to connect again today we will start the full sync step again right? this will recreate the replica flow classes. But what if we want to continue from where we failed if possible ?
Maybe I should not think about this now and we will get to it once we need to do the partial sync

Yes, I would suggest thinking about this when we do partial sync. We can just trivially drop all the data we have inside the transaction map, etc...

dranikpg · 2023-01-05T14:02:22Z

@adiholden

I have some problem with the cancellation in this which I will continue investigate.

What's the problem you currently have? Maybe I can give you some suggestions or assumptions.

Making it pass the pytests should be a good goal. If it survives the disconnect ones - then it should survive almost all real world use cases

dranikpg · 2023-01-05T14:22:59Z

I see now, Replica::Stop doesn't make it stop

dranikpg · 2023-01-05T14:36:35Z

src/server/replica.cc

+      }
+    }
  };


You forgot to unblock all the wakers, thats why its stuck (stuck on joining execution fibers). I've added it and it seems to work

Signed-off-by: adi_holden <adi@dragonflydb.io>

feat(replica): support multi shard commands new flow

b64c16c

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden requested a review from dranikpg January 5, 2023 13:44

dranikpg reviewed Jan 5, 2023

View reviewed changes

feat(replica): add waker notify on error handler

c40ecd5

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden requested a review from dranikpg January 8, 2023 09:17

dranikpg approved these changes Jan 8, 2023

View reviewed changes

dranikpg merged commit 77bd5f5 into main Jan 8, 2023

romange deleted the replica_multi_shard_cmd branch April 12, 2023 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(replica): support multi shard commands new flow #645

feat(replica): support multi shard commands new flow #645

adiholden commented Jan 5, 2023

dranikpg Jan 5, 2023

adiholden Jan 8, 2023

dranikpg Jan 8, 2023

dranikpg Jan 5, 2023

adiholden Jan 8, 2023

dranikpg Jan 8, 2023

dranikpg commented Jan 5, 2023

dranikpg commented Jan 5, 2023

dranikpg Jan 5, 2023 •

edited

Loading

feat(replica): support multi shard commands new flow #645

feat(replica): support multi shard commands new flow #645

Conversation

adiholden commented Jan 5, 2023

dranikpg Jan 5, 2023

Choose a reason for hiding this comment

adiholden Jan 8, 2023

Choose a reason for hiding this comment

dranikpg Jan 8, 2023

Choose a reason for hiding this comment

dranikpg Jan 5, 2023

Choose a reason for hiding this comment

adiholden Jan 8, 2023

Choose a reason for hiding this comment

dranikpg Jan 8, 2023

Choose a reason for hiding this comment

dranikpg commented Jan 5, 2023

dranikpg commented Jan 5, 2023

dranikpg Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

dranikpg Jan 5, 2023 •

edited

Loading