Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface provisioning failures to user #901

Closed
morgsmccauley opened this issue Jul 19, 2024 · 0 comments · Fixed by #1002
Closed

Surface provisioning failures to user #901

morgsmccauley opened this issue Jul 19, 2024 · 0 comments · Fixed by #1002
Assignees

Comments

@morgsmccauley
Copy link
Collaborator

With the move to user owned logs tables, we removed the provisioning failure logs - if we fail to provisioning, it may not be possible to log the failure. We need to surface this information somehow, otherwise developers won't know what's gone wrong.

We have two options:

  1. Re-introduce a shared/permanent logs table, specifically for provisioning failures
  2. Log failures to the user owned table

I'm leaning towards 2, but it is tricky given that provisioning of the logs table itself may fail. But, most failures happen within provisioning of user related resources, so we have some confidence that the logs table should always be provisioned unless something is seriously wrong. With the correct alerts and testing (which we have) in place, we should be able to catch real provisioning failures through our own channels, and surface user related failures via there logs table.

morgsmccauley added a commit that referenced this issue Aug 9, 2024
This PR updates provisioning such that user-related errors are written
to the user-owned logs table, allowing them to debug issues with their
schema.

Logging to the user table is tricky, since this table is created
_during_ the provisioning step itself. Therefore, I have split
provisioning in to two phases:
1. System Resources - Setups up system related entities: database,
schema, logs table/jobs etc.
2. User Resources - Applies user schema, configures Hasura etc.

This separation allows us to isolate the tasks which are likely to fail
due to user error, and therefore only surface errors which are relevant.

The creation of the logs table _should always succeed_, if it doesn't
there is something wrong with the system, i.e. some form of bug has been
introduced. Errors thrown during the System portion of provisioning will
be error logged to the machine, and I will tune the existing alert so
that we are notified of these errors.

Additionally, I have converted all non-critical error logs to warnings,
so that we don't get alerted on non-issues.

closes: #901
@morgsmccauley morgsmccauley self-assigned this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant