Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

Thoughts on architecture #2

Closed
amercader opened this issue Oct 20, 2016 · 9 comments
Closed

Thoughts on architecture #2

amercader opened this issue Oct 20, 2016 · 9 comments
Assignees
Milestone

Comments

@amercader
Copy link
Member

@roll @pwalsh Apologies for the long post and what might be seen as obvious stuff. This is mainly me thinking out loud with the help of some pictures, but would appreciate your input.

I'm trying to think of the simplest architecture for the data validation service and working up from there based on the requirements and user stories.

At the most basic level we want to run a validation job:

  1. We somehow got hold of a Tabular data file (let's not get into how we got it for now)
  2. We run a validation command against it (based on one of the goodtables libraries)
  3. The validation command generates some kind of output

goodtables io architecture1

Let's expand on this for a bit. Although we might have a default behaviour with no defined configuration, we will probably want to customize the validation with configuration options. These could control things like what files to validate in case there are many of them, whether there is a schema, what tests to run, etc.

Note that I'm running validation on multiple related files (eg same GitHub repo) on the same job just because it seems easier than to run them separately and combine the outputs later but this is not a hard requirement.

goodtables io architecture2

These jobs will be started from a pool or queue of pending ones. This should be able to handle parallelization, scaling of resources, etc

goodtables io architecture3

On top of that there is an API that gets requests for validation, starts jobs and returns status and outputs (and handles authorization, of course). Even higher than that there will be the Web UI for integration and reporting but I'm less concerned about that now as it could be just a client on top of the API.

goodtables io architecture4

I know we've been focusing on GitHub and I agree that's the most valuable integration to begin with but I think having this generic API on top would allow much more flexibility with other third-party integrations in my view. GitHub would be just another "integration" or plugin on top of the API, which would handle the specifics of each source, eg:

  • GitHub: respond to GitHub's webhooks (preconfigured by the user by "switching on" their repo on the GT.io web UI), trigger a new job if necessary, passing the config options translated from the .goodtables.yml file and a list of files to validate (and maybe the necessary credentials if it's a private repo?)
  • S3: Regardless of how we get notified of changes (either we get notified, but not sure how easy to set this up it will be, or we poll regularly the bucket) we also pass the config options translated from the .goodtables.yml file and a list of files to validate (and maybe the necessary credentials if it's a private bucket?)

Other sources could interact with the API directly:

  • CKAN: Whenever a new tabular file gets uploaded or linked an extension pings the API directly with a link to the file and some config options
  • Any third party app with files available online: whenever they need to validate a file, they ping the API with a link to their file(s) and config options (or we create an integration that hooks into whatever notification system they have)
  • Any third-party desktop app (Comma Chamelon, Excel, gt.io CLI client, ...): whenever you want to validate a file, they POST it to the API, along with a schema and config options if needed.

goodtables io architecture 4

I haven't given much though to the actual tech involved in this, but I'm definitely thinking that GitLab CI might not be the best fit for this. It does give us a lot in terms of all the CI infrastucture as in runners, etc but at the expense of having to fit each of the different sources into a git based workflow. And also having to design any higher level API with the constraints of what GitLab offers. Perhaps something simpler based on a queue and job runs gives us more flexibility to start with.

How does this sound as a general approach? Can you see any major drawbacks?

If that sounds good, any thoughts on technology that could fit this scenario? Jenkins, Celery?

Would love your feedback on this.

Drawings source is here

@pwalsh
Copy link
Member

pwalsh commented Oct 20, 2016

Adria it is a pleasure to read such a well thought out response to the high-level requirements, a clear though process in the best architecture for our goals.

I'm actually leaning more now to running jobs on celery rather than gitlab, because I don't want to find that we paint our selves into a corner where we require git-compatible clients.

There is one major downside though: without git, we don't have a map of the file system in our directory that a job runs against, and so clients might not be able to use pattern matching in goodtables.yml; rather, we might require explicit file lists.

@amercader
Copy link
Member Author

@pwalsh It would be up to the integrations/plugins to leverage whatever their sources allow. If they are git based (eg GitHub) they could clone the repo (ideally somewhere that is also accessible to the job runners so the files are already downloaded for later use), do the pattern matching thing and pass the list with absolute paths/URLs to the runner. Or alternatively they could use the GitHub tree API to browse the repo, calculate the files that need to be validated without cloning and pass that to the runner so it downloads them (probably more complex but more efficient as you don't need to clone the whole repo). Other plugins can do similar preprocessing if needed before calling the main API.

@pwalsh
Copy link
Member

pwalsh commented Oct 20, 2016

@amercader noted. Let's do it. Celery? Other options?

@vitorbaptista
Copy link
Contributor

Very interesting analysis, @amercader. Now that I understand it better, I can see why you thought about a data pipeline framework to implement this. It does feel like Airflow or Luigi are good fits for handling the scheduling and running the jobs for this 👍

@roll
Copy link
Member

roll commented Oct 24, 2016

@amercader
So the responsibility of the API between user and goodatables CLI/API will be mostly job-management?

@amercader
Copy link
Member Author

@roll Yes, I suppose so. Triggering jobs responding to calls from the integrations and returning status from them.
I'm still not clear at what level the auth should be managed though. It feels like it should be one level above the integrations, (ie users register and login to gt.io, maybe via Github, Google, etc) and then they enable the particular integration they want.

Also would be worth considering paid plans (quotas, different tiers of priority for jobs, etc) and how does this fit into the architecture.

@pwalsh
Copy link
Member

pwalsh commented Oct 31, 2016

@amercader

So, we are all agreed. Can you take these ideas and modify the requirements as current specified in the README, to reflect these changes, and any other changes that are relevant, in comparison to what I wrote originally.

@pwalsh
Copy link
Member

pwalsh commented Oct 31, 2016

@amercader and such changes, of course, can be a pull request that closes this issue ;)

amercader added a commit that referenced this issue Nov 1, 2016
amercader added a commit that referenced this issue Nov 1, 2016
@pwalsh
Copy link
Member

pwalsh commented Nov 1, 2016

Resolved in #9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants