feat: Deprovision Data Layer on delete #808

morgsmccauley · 2024-06-18T09:32:37Z

This PR removes Data Layer resources on Indexer Delete. To achieve this, the following has been added:

Provisioner.deprovision() method which removes: schema, cron jobs, and if necessary, Hasura source, database, and role
DataLayerService.StartDeprovisioningTask gRPC method
Calling the above from Coordinator within the delete lifecycle hook

In addition to the above, I've slightly refactored DataLayerService to make the addition of de-provisioning more accomodating:

StartDeprovisioningTask and StartProvisioningTask now return opaque IDs rather than using accountId/functionName - to avoid conflicts with eachother
There is a single GetTaskStatus method which is used for both, before it was specific to provisioning

As mentioned in #805, the Coordinator implementation is a little awkward due to the shared/non-blocking Control Loop. I'll look to refactor this later and hopefully improve on this.

darunrs

Looks mostly good to me! Obviously this is a large impact change. Have you done any local tests of running all four services locally? It also seems like you've got a somewhat large merge conflict to fix haha...

darunrs · 2024-06-20T01:59:58Z

coordinator/src/indexer_state.rs

+    Provisioning { task_id: String },
+    Provisioned,
+    Deprovisioning { task_id: String },
+    Failed,


Is Failed sufficient to redress the situation? Failure in Provisioning is different from failure in deprovisioning. Especially since the enum indicates there's no additional data stored.

No - but at this point it isn't my goal, I will address redundancy in future work.

coordinator/src/indexer_state.rs

darunrs · 2024-06-20T02:35:00Z

coordinator/src/synchroniser.rs

+                self.ensure_provisioned(config, task_id.clone()).await?;
+                return Ok(());
+            }
+            ProvisionedState::Failed => return Ok(()),


Is this return Ok() to just not do anything? I figure we'd spot something when we get an alarm when Executed or Processed block count falls to 0, or whatever our error metric evolves to.
Where should retry logic exist? I'm thinking maybe each service can have a "restart" API which coordinator can call to perform a restart with some given configuration. This way, Coordinator is only responsible for starting the process, whereas the service itself owns the business logic.

For now, it's probably fine. This means we don't start the block stream or executor, which makes sense because it won't work anyway.

Restart/retry will be tricky, we need to expose more information from the DataLayerService, to know if we actually can retry. If it's a schema error there is no point.

From the user side, they'll just see the Indexer hanging on "PROVISIONING" status, but I'm going to add some more logging so they can see the error.

darunrs · 2024-06-20T02:38:27Z

coordinator/src/synchroniser.rs

+                anyhow::bail!("Provisioning task should have been started")
+            }
+        }
+
        if !state.enabled {


DO you think it makes sense to break these out into small functions? The syncing logic seems complex so it might be nice to have the main functions be understandable at a glace and functions encapsulate the underlying logic. I think it would look nice to see a sync_exisitng_indexer function which looks like:

verify_provisioned() check_enabled() sync_block_stream() sync_executor() sync_data_layer()

or something like that. I think a lot of those are shared between each sync types (new, existing, deleted, etc.).

Yeah, we should. The whole thing is getting a bit messy tbh, I'm planning a refactor soon, I'll clean this up then.

Ah, I remember why I didn't do this. I need some of these inline to have more control over the control flow of the function, i.e. I want to short-circuit synchronisation if it isn't provisioned yet. I can't do that from another function unless I add similar code to check and return.

I'll have a deeper think about this going forward, and see how I can tidy things up.

darunrs · 2024-06-20T02:40:37Z

coordinator/src/synchroniser.rs

@@ -211,6 +206,35 @@ impl<'a> Synchroniser<'a> {
        Ok(())
    }

+    async fn ensure_provisioned(


How is this function related to checking the state of the indexer itself. It seems this function is focused on ensuring a related provisioning task exists. If this function also checked state, we could ignore context right? When I look at the function as it is, I wonder if this is before or after a state check. Then again, separating logic is good too. I think it would be good to establish an order of precedence for a data layer task and the corresponding indexer state value, and keep the checks collected under one function, even if its like below. What do you think?

check_data_layer_state() (returns if fine) check_data_layer_task_status() (calls if above requires checking of task)

Sorry, I'm not quite sure I follow. Can you please explain again?

The purpose of this function is to poll the pending Data Layer Provisioning task, and update the Indexer State when complete (or failed). At which point we will move to the next "phase" of the Indexer, starting block streams/executors.

I think part of the confusion here comes from the shared control loop, which I outline here: #811

darunrs · 2024-06-20T02:41:34Z

coordinator/src/synchroniser.rs

+                    tracing::info!("Data layer deprovisioning complete");
+                }
+                TaskStatus::Failed => {
+                    tracing::info!("Data layer deprovisioning failed");


This is an error log right? Would this be repeated each loop? Perhaps we can simply let the error log be the single error log instance, and return Ok() here?

It's not quite an error log, provisioning may fail because there schema is messed up, which isn't necessarily a system error. Maybe a warn? 🤔

runner-client/examples/check_provisioning_task_status.rs

darunrs · 2024-06-20T02:51:54Z

runner/src/provisioner/provisioner.test.ts

+        .mockResolvedValueOnce(null) // drop schema
+        .mockResolvedValueOnce(null) // unschedule create partition job
+        .mockResolvedValueOnce(null) // unschedule delete partition job
+        .mockResolvedValueOnce({ rows: [{ schema_name: 'another_one' }] }); // list schemas


Considering its possible for a cron job to silently fail sometimes, I think we should also verify the cron jobs are indeed deleted, before fully accepting a success. A failing cron job doesn't harm the DB but it does add noise to the cron job status and history table, which would make debugging harder, if we ever need to do so.

Good point. I feel it makes more sense to assert this behaviour in tests, rather than creating additional code for it? I'll update the integration tests to ensure things are being removed.

Wait, what do you mean cron jobs silently fail? Removal silently fails?

morgsmccauley · 2024-06-20T08:32:11Z

@darunrs merging now, but let's keep discussing and I can make updates in future PRs

morgsmccauley linked an issue Jun 18, 2024 that may be closed by this pull request

De-provision resources on Indexer deletion #761

Closed

morgsmccauley changed the base branch from main to feat/handle-provisioning-in-coordinator June 18, 2024 09:33

morgsmccauley force-pushed the feat/deprovisioning branch from dd94813 to 574d094 Compare June 18, 2024 20:14

morgsmccauley marked this pull request as ready for review June 18, 2024 20:35

morgsmccauley requested a review from a team as a code owner June 18, 2024 20:35

Base automatically changed from feat/handle-provisioning-in-coordinator to main June 19, 2024 20:35

darunrs approved these changes Jun 20, 2024

View reviewed changes

morgsmccauley added 10 commits June 20, 2024 15:20

chore: Add logging to DataLayerService

8b45444

feat: Drop schema during deprovision

85ee540

feat: Unschedule cron jobs

07ac1aa

feat: Drop hasura datasource

be0e0ef

feat: Drop database

07a27fb

feat: Drop role

7ef08aa

fix: Format list schemas query correctly

e22b38a

fix: Correctly format drop db query

d8ba5ea

feat: Revoke cron access

e5fd4fd

test: Add integration test for deprovisioning

fce30f2

morgsmccauley force-pushed the feat/deprovisioning branch from 34249cc to 187db82 Compare June 20, 2024 03:27

morgsmccauley added 12 commits June 20, 2024 15:47

refactor: Change data layer svc methods to make extension easier

54525f6

refactor: Use deterministic hashes to prevent duplicate tasks

f8d0661

feat: Deprovision via rpc

9e65602

feat: Update rust data layer client

c209a83

feat: Return current task if it exists

4d699d5

refactor: Provision via updated rpc proto

42b9b00

feat: Deprovision data layer on delete indexer

9f4e152

fix: Handle existing indexers without provisioning task

eebc305

feat: Remove Redis Stream on delete

ed6a2a9

test: Fix synchroniser tests after rebase

c04ca63

chore: Log when using default state

7ce18e7

chore: Update example names for clarity

6a3bfae

morgsmccauley force-pushed the feat/deprovisioning branch from 187db82 to 6a3bfae Compare June 20, 2024 03:47

morgsmccauley added 2 commits June 20, 2024 15:55

chore: Use warn over info

c71dcac

test: Assert data layer non-/existance in integration tests

23f0693

morgsmccauley force-pushed the feat/deprovisioning branch from 322aba1 to 23f0693 Compare June 20, 2024 07:55

fix: Correctly query user owned schemas

91135bb

morgsmccauley force-pushed the feat/deprovisioning branch from 54421d7 to 91135bb Compare June 20, 2024 08:20

fix: Ensura hasura metadata is cleaned up after schema drop

ca53f70

morgsmccauley merged commit 64d1ebb into main Jun 20, 2024
7 checks passed

morgsmccauley deleted the feat/deprovisioning branch June 20, 2024 08:32

morgsmccauley mentioned this pull request Jun 24, 2024

Prod Release 25/06/24 #832

Merged

morgsmccauley mentioned this pull request Jul 3, 2024

Prod Release 03/07/24 #851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Deprovision Data Layer on delete #808

feat: Deprovision Data Layer on delete #808

morgsmccauley commented Jun 18, 2024 •

edited

Loading

darunrs left a comment •

edited

Loading

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

morgsmccauley Jun 20, 2024

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

darunrs Jun 20, 2024

morgsmccauley Jun 20, 2024

morgsmccauley Jun 20, 2024

morgsmccauley commented Jun 20, 2024

feat: Deprovision Data Layer on delete #808

feat: Deprovision Data Layer on delete #808

Conversation

morgsmccauley commented Jun 18, 2024 • edited Loading

darunrs left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

morgsmccauley commented Jun 20, 2024

morgsmccauley commented Jun 18, 2024 •

edited

Loading

darunrs left a comment •

edited

Loading