Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Deprovision Data Layer on delete #808

Merged
merged 26 commits into from
Jun 20, 2024
Merged

Conversation

morgsmccauley
Copy link
Collaborator

@morgsmccauley morgsmccauley commented Jun 18, 2024

This PR removes Data Layer resources on Indexer Delete. To achieve this, the following has been added:

  • Provisioner.deprovision() method which removes: schema, cron jobs, and if necessary, Hasura source, database, and role
  • DataLayerService.StartDeprovisioningTask gRPC method
  • Calling the above from Coordinator within the delete lifecycle hook

In addition to the above, I've slightly refactored DataLayerService to make the addition of de-provisioning more accomodating:

  • StartDeprovisioningTask and StartProvisioningTask now return opaque IDs rather than using accountId/functionName - to avoid conflicts with eachother
  • There is a single GetTaskStatus method which is used for both, before it was specific to provisioning

As mentioned in #805, the Coordinator implementation is a little awkward due to the shared/non-blocking Control Loop. I'll look to refactor this later and hopefully improve on this.

@morgsmccauley morgsmccauley linked an issue Jun 18, 2024 that may be closed by this pull request
@morgsmccauley morgsmccauley changed the base branch from main to feat/handle-provisioning-in-coordinator June 18, 2024 09:33
@morgsmccauley morgsmccauley marked this pull request as ready for review June 18, 2024 20:35
@morgsmccauley morgsmccauley requested a review from a team as a code owner June 18, 2024 20:35
Base automatically changed from feat/handle-provisioning-in-coordinator to main June 19, 2024 20:35
Copy link
Collaborator

@darunrs darunrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me! Obviously this is a large impact change. Have you done any local tests of running all four services locally? It also seems like you've got a somewhat large merge conflict to fix haha...

Provisioning { task_id: String },
Provisioned,
Deprovisioning { task_id: String },
Failed,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Failed sufficient to redress the situation? Failure in Provisioning is different from failure in deprovisioning. Especially since the enum indicates there's no additional data stored.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - but at this point it isn't my goal, I will address redundancy in future work.

coordinator/src/indexer_state.rs Outdated Show resolved Hide resolved
coordinator/src/indexer_state.rs Outdated Show resolved Hide resolved
coordinator/src/indexer_state.rs Outdated Show resolved Hide resolved
self.ensure_provisioned(config, task_id.clone()).await?;
return Ok(());
}
ProvisionedState::Failed => return Ok(()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this return Ok() to just not do anything? I figure we'd spot something when we get an alarm when Executed or Processed block count falls to 0, or whatever our error metric evolves to.
Where should retry logic exist? I'm thinking maybe each service can have a "restart" API which coordinator can call to perform a restart with some given configuration. This way, Coordinator is only responsible for starting the process, whereas the service itself owns the business logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, it's probably fine. This means we don't start the block stream or executor, which makes sense because it won't work anyway.

Restart/retry will be tricky, we need to expose more information from the DataLayerService, to know if we actually can retry. If it's a schema error there is no point.

From the user side, they'll just see the Indexer hanging on "PROVISIONING" status, but I'm going to add some more logging so they can see the error.

anyhow::bail!("Provisioning task should have been started")
}
}

if !state.enabled {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO you think it makes sense to break these out into small functions? The syncing logic seems complex so it might be nice to have the main functions be understandable at a glace and functions encapsulate the underlying logic. I think it would look nice to see a sync_exisitng_indexer function which looks like:

verify_provisioned()
check_enabled()
sync_block_stream()
sync_executor()
sync_data_layer()

or something like that. I think a lot of those are shared between each sync types (new, existing, deleted, etc.).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should. The whole thing is getting a bit messy tbh, I'm planning a refactor soon, I'll clean this up then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I remember why I didn't do this. I need some of these inline to have more control over the control flow of the function, i.e. I want to short-circuit synchronisation if it isn't provisioned yet. I can't do that from another function unless I add similar code to check and return.

I'll have a deeper think about this going forward, and see how I can tidy things up.

@@ -211,6 +206,35 @@ impl<'a> Synchroniser<'a> {
Ok(())
}

async fn ensure_provisioned(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this function related to checking the state of the indexer itself. It seems this function is focused on ensuring a related provisioning task exists. If this function also checked state, we could ignore context right? When I look at the function as it is, I wonder if this is before or after a state check. Then again, separating logic is good too. I think it would be good to establish an order of precedence for a data layer task and the corresponding indexer state value, and keep the checks collected under one function, even if its like below. What do you think?

check_data_layer_state() (returns if fine)
check_data_layer_task_status() (calls if above requires checking of task)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not quite sure I follow. Can you please explain again?

The purpose of this function is to poll the pending Data Layer Provisioning task, and update the Indexer State when complete (or failed). At which point we will move to the next "phase" of the Indexer, starting block streams/executors.

I think part of the confusion here comes from the shared control loop, which I outline here: #811

tracing::info!("Data layer deprovisioning complete");
}
TaskStatus::Failed => {
tracing::info!("Data layer deprovisioning failed");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error log right? Would this be repeated each loop? Perhaps we can simply let the error log be the single error log instance, and return Ok() here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite an error log, provisioning may fail because there schema is messed up, which isn't necessarily a system error. Maybe a warn? 🤔

runner-client/examples/check_provisioning_task_status.rs Outdated Show resolved Hide resolved
.mockResolvedValueOnce(null) // drop schema
.mockResolvedValueOnce(null) // unschedule create partition job
.mockResolvedValueOnce(null) // unschedule delete partition job
.mockResolvedValueOnce({ rows: [{ schema_name: 'another_one' }] }); // list schemas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering its possible for a cron job to silently fail sometimes, I think we should also verify the cron jobs are indeed deleted, before fully accepting a success. A failing cron job doesn't harm the DB but it does add noise to the cron job status and history table, which would make debugging harder, if we ever need to do so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I feel it makes more sense to assert this behaviour in tests, rather than creating additional code for it? I'll update the integration tests to ensure things are being removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what do you mean cron jobs silently fail? Removal silently fails?

@morgsmccauley
Copy link
Collaborator Author

@darunrs merging now, but let's keep discussing and I can make updates in future PRs

@morgsmccauley morgsmccauley merged commit 64d1ebb into main Jun 20, 2024
7 checks passed
@morgsmccauley morgsmccauley deleted the feat/deprovisioning branch June 20, 2024 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

De-provision resources on Indexer deletion
2 participants