-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Deprovision Data Layer on delete #808
Conversation
dd94813
to
574d094
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me! Obviously this is a large impact change. Have you done any local tests of running all four services locally? It also seems like you've got a somewhat large merge conflict to fix haha...
Provisioning { task_id: String }, | ||
Provisioned, | ||
Deprovisioning { task_id: String }, | ||
Failed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Failed sufficient to redress the situation? Failure in Provisioning is different from failure in deprovisioning. Especially since the enum indicates there's no additional data stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - but at this point it isn't my goal, I will address redundancy in future work.
self.ensure_provisioned(config, task_id.clone()).await?; | ||
return Ok(()); | ||
} | ||
ProvisionedState::Failed => return Ok(()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this return Ok() to just not do anything? I figure we'd spot something when we get an alarm when Executed or Processed block count falls to 0, or whatever our error metric evolves to.
Where should retry logic exist? I'm thinking maybe each service can have a "restart" API which coordinator can call to perform a restart with some given configuration. This way, Coordinator is only responsible for starting the process, whereas the service itself owns the business logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it's probably fine. This means we don't start the block stream or executor, which makes sense because it won't work anyway.
Restart/retry will be tricky, we need to expose more information from the DataLayerService
, to know if we actually can retry. If it's a schema error there is no point.
From the user side, they'll just see the Indexer hanging on "PROVISIONING" status, but I'm going to add some more logging so they can see the error.
anyhow::bail!("Provisioning task should have been started") | ||
} | ||
} | ||
|
||
if !state.enabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DO you think it makes sense to break these out into small functions? The syncing logic seems complex so it might be nice to have the main functions be understandable at a glace and functions encapsulate the underlying logic. I think it would look nice to see a sync_exisitng_indexer function which looks like:
verify_provisioned()
check_enabled()
sync_block_stream()
sync_executor()
sync_data_layer()
or something like that. I think a lot of those are shared between each sync types (new, existing, deleted, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should. The whole thing is getting a bit messy tbh, I'm planning a refactor soon, I'll clean this up then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I remember why I didn't do this. I need some of these inline to have more control over the control flow of the function, i.e. I want to short-circuit synchronisation if it isn't provisioned yet. I can't do that from another function unless I add similar code to check and return.
I'll have a deeper think about this going forward, and see how I can tidy things up.
@@ -211,6 +206,35 @@ impl<'a> Synchroniser<'a> { | |||
Ok(()) | |||
} | |||
|
|||
async fn ensure_provisioned( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this function related to checking the state of the indexer itself. It seems this function is focused on ensuring a related provisioning task exists. If this function also checked state, we could ignore context right? When I look at the function as it is, I wonder if this is before or after a state check. Then again, separating logic is good too. I think it would be good to establish an order of precedence for a data layer task and the corresponding indexer state value, and keep the checks collected under one function, even if its like below. What do you think?
check_data_layer_state() (returns if fine)
check_data_layer_task_status() (calls if above requires checking of task)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not quite sure I follow. Can you please explain again?
The purpose of this function is to poll the pending Data Layer Provisioning task, and update the Indexer State when complete (or failed). At which point we will move to the next "phase" of the Indexer, starting block streams/executors.
I think part of the confusion here comes from the shared control loop, which I outline here: #811
tracing::info!("Data layer deprovisioning complete"); | ||
} | ||
TaskStatus::Failed => { | ||
tracing::info!("Data layer deprovisioning failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an error log right? Would this be repeated each loop? Perhaps we can simply let the error log be the single error log instance, and return Ok() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not quite an error log, provisioning may fail because there schema is messed up, which isn't necessarily a system error. Maybe a warn? 🤔
.mockResolvedValueOnce(null) // drop schema | ||
.mockResolvedValueOnce(null) // unschedule create partition job | ||
.mockResolvedValueOnce(null) // unschedule delete partition job | ||
.mockResolvedValueOnce({ rows: [{ schema_name: 'another_one' }] }); // list schemas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering its possible for a cron job to silently fail sometimes, I think we should also verify the cron jobs are indeed deleted, before fully accepting a success. A failing cron job doesn't harm the DB but it does add noise to the cron job status and history table, which would make debugging harder, if we ever need to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I feel it makes more sense to assert this behaviour in tests, rather than creating additional code for it? I'll update the integration tests to ensure things are being removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what do you mean cron jobs silently fail? Removal silently fails?
34249cc
to
187db82
Compare
187db82
to
6a3bfae
Compare
322aba1
to
23f0693
Compare
54421d7
to
91135bb
Compare
@darunrs merging now, but let's keep discussing and I can make updates in future PRs |
This PR removes Data Layer resources on Indexer Delete. To achieve this, the following has been added:
Provisioner.deprovision()
method which removes: schema, cron jobs, and if necessary, Hasura source, database, and roleDataLayerService.StartDeprovisioningTask
gRPC methodIn addition to the above, I've slightly refactored DataLayerService to make the addition of de-provisioning more accomodating:
StartDeprovisioningTask
andStartProvisioningTask
now return opaque IDs rather than usingaccountId
/functionName
- to avoid conflicts with eachotherGetTaskStatus
method which is used for both, before it was specific to provisioningAs mentioned in #805, the Coordinator implementation is a little awkward due to the shared/non-blocking Control Loop. I'll look to refactor this later and hopefully improve on this.