-
Notifications
You must be signed in to change notification settings - Fork 16
Thoughts on architecture #2
Comments
Adria it is a pleasure to read such a well thought out response to the high-level requirements, a clear though process in the best architecture for our goals. I'm actually leaning more now to running jobs on celery rather than gitlab, because I don't want to find that we paint our selves into a corner where we require git-compatible clients. There is one major downside though: without git, we don't have a map of the file system in our directory that a job runs against, and so clients might not be able to use pattern matching in goodtables.yml; rather, we might require explicit file lists. |
@pwalsh It would be up to the integrations/plugins to leverage whatever their sources allow. If they are git based (eg GitHub) they could clone the repo (ideally somewhere that is also accessible to the job runners so the files are already downloaded for later use), do the pattern matching thing and pass the list with absolute paths/URLs to the runner. Or alternatively they could use the GitHub tree API to browse the repo, calculate the files that need to be validated without cloning and pass that to the runner so it downloads them (probably more complex but more efficient as you don't need to clone the whole repo). Other plugins can do similar preprocessing if needed before calling the main API. |
@amercader noted. Let's do it. Celery? Other options? |
Very interesting analysis, @amercader. Now that I understand it better, I can see why you thought about a data pipeline framework to implement this. It does feel like Airflow or Luigi are good fits for handling the scheduling and running the jobs for this 👍 |
@amercader |
@roll Yes, I suppose so. Triggering jobs responding to calls from the integrations and returning status from them. Also would be worth considering paid plans (quotas, different tiers of priority for jobs, etc) and how does this fit into the architecture. |
So, we are all agreed. Can you take these ideas and modify the requirements as current specified in the README, to reflect these changes, and any other changes that are relevant, in comparison to what I wrote originally. |
@amercader and such changes, of course, can be a pull request that closes this issue ;) |
Resolved in #9 |
@roll @pwalsh Apologies for the long post and what might be seen as obvious stuff. This is mainly me thinking out loud with the help of some pictures, but would appreciate your input.
I'm trying to think of the simplest architecture for the data validation service and working up from there based on the requirements and user stories.
At the most basic level we want to run a validation job:
Let's expand on this for a bit. Although we might have a default behaviour with no defined configuration, we will probably want to customize the validation with configuration options. These could control things like what files to validate in case there are many of them, whether there is a schema, what tests to run, etc.
Note that I'm running validation on multiple related files (eg same GitHub repo) on the same job just because it seems easier than to run them separately and combine the outputs later but this is not a hard requirement.
These jobs will be started from a pool or queue of pending ones. This should be able to handle parallelization, scaling of resources, etc
On top of that there is an API that gets requests for validation, starts jobs and returns status and outputs (and handles authorization, of course). Even higher than that there will be the Web UI for integration and reporting but I'm less concerned about that now as it could be just a client on top of the API.
I know we've been focusing on GitHub and I agree that's the most valuable integration to begin with but I think having this generic API on top would allow much more flexibility with other third-party integrations in my view. GitHub would be just another "integration" or plugin on top of the API, which would handle the specifics of each source, eg:
.goodtables.yml
file and a list of files to validate (and maybe the necessary credentials if it's a private repo?).goodtables.yml
file and a list of files to validate (and maybe the necessary credentials if it's a private bucket?)Other sources could interact with the API directly:
I haven't given much though to the actual tech involved in this, but I'm definitely thinking that GitLab CI might not be the best fit for this. It does give us a lot in terms of all the CI infrastucture as in runners, etc but at the expense of having to fit each of the different sources into a git based workflow. And also having to design any higher level API with the constraints of what GitLab offers. Perhaps something simpler based on a queue and job runs gives us more flexibility to start with.
How does this sound as a general approach? Can you see any major drawbacks?
If that sounds good, any thoughts on technology that could fit this scenario? Jenkins, Celery?
Would love your feedback on this.
Drawings source is here
The text was updated successfully, but these errors were encountered: