Thoughts on architecture #2

amercader · 2016-10-20T12:45:12Z

@roll @pwalsh Apologies for the long post and what might be seen as obvious stuff. This is mainly me thinking out loud with the help of some pictures, but would appreciate your input.

I'm trying to think of the simplest architecture for the data validation service and working up from there based on the requirements and user stories.

At the most basic level we want to run a validation job:

We somehow got hold of a Tabular data file (let's not get into how we got it for now)
We run a validation command against it (based on one of the goodtables libraries)
The validation command generates some kind of output

Let's expand on this for a bit. Although we might have a default behaviour with no defined configuration, we will probably want to customize the validation with configuration options. These could control things like what files to validate in case there are many of them, whether there is a schema, what tests to run, etc.

Note that I'm running validation on multiple related files (eg same GitHub repo) on the same job just because it seems easier than to run them separately and combine the outputs later but this is not a hard requirement.

These jobs will be started from a pool or queue of pending ones. This should be able to handle parallelization, scaling of resources, etc

On top of that there is an API that gets requests for validation, starts jobs and returns status and outputs (and handles authorization, of course). Even higher than that there will be the Web UI for integration and reporting but I'm less concerned about that now as it could be just a client on top of the API.

I know we've been focusing on GitHub and I agree that's the most valuable integration to begin with but I think having this generic API on top would allow much more flexibility with other third-party integrations in my view. GitHub would be just another "integration" or plugin on top of the API, which would handle the specifics of each source, eg:

GitHub: respond to GitHub's webhooks (preconfigured by the user by "switching on" their repo on the GT.io web UI), trigger a new job if necessary, passing the config options translated from the .goodtables.yml file and a list of files to validate (and maybe the necessary credentials if it's a private repo?)
S3: Regardless of how we get notified of changes (either we get notified, but not sure how easy to set this up it will be, or we poll regularly the bucket) we also pass the config options translated from the .goodtables.yml file and a list of files to validate (and maybe the necessary credentials if it's a private bucket?)

Other sources could interact with the API directly:

CKAN: Whenever a new tabular file gets uploaded or linked an extension pings the API directly with a link to the file and some config options
Any third party app with files available online: whenever they need to validate a file, they ping the API with a link to their file(s) and config options (or we create an integration that hooks into whatever notification system they have)
Any third-party desktop app (Comma Chamelon, Excel, gt.io CLI client, ...): whenever you want to validate a file, they POST it to the API, along with a schema and config options if needed.

I haven't given much though to the actual tech involved in this, but I'm definitely thinking that GitLab CI might not be the best fit for this. It does give us a lot in terms of all the CI infrastucture as in runners, etc but at the expense of having to fit each of the different sources into a git based workflow. And also having to design any higher level API with the constraints of what GitLab offers. Perhaps something simpler based on a queue and job runs gives us more flexibility to start with.

How does this sound as a general approach? Can you see any major drawbacks?

If that sounds good, any thoughts on technology that could fit this scenario? Jenkins, Celery?

Would love your feedback on this.

Drawings source is here

The text was updated successfully, but these errors were encountered:

pwalsh · 2016-10-20T14:41:18Z

Adria it is a pleasure to read such a well thought out response to the high-level requirements, a clear though process in the best architecture for our goals.

I'm actually leaning more now to running jobs on celery rather than gitlab, because I don't want to find that we paint our selves into a corner where we require git-compatible clients.

There is one major downside though: without git, we don't have a map of the file system in our directory that a job runs against, and so clients might not be able to use pattern matching in goodtables.yml; rather, we might require explicit file lists.

amercader · 2016-10-20T16:32:46Z

@pwalsh It would be up to the integrations/plugins to leverage whatever their sources allow. If they are git based (eg GitHub) they could clone the repo (ideally somewhere that is also accessible to the job runners so the files are already downloaded for later use), do the pattern matching thing and pass the list with absolute paths/URLs to the runner. Or alternatively they could use the GitHub tree API to browse the repo, calculate the files that need to be validated without cloning and pass that to the runner so it downloads them (probably more complex but more efficient as you don't need to clone the whole repo). Other plugins can do similar preprocessing if needed before calling the main API.

pwalsh · 2016-10-20T16:46:09Z

@amercader noted. Let's do it. Celery? Other options?

vitorbaptista · 2016-10-21T08:51:06Z

Very interesting analysis, @amercader. Now that I understand it better, I can see why you thought about a data pipeline framework to implement this. It does feel like Airflow or Luigi are good fits for handling the scheduling and running the jobs for this 👍

roll · 2016-10-24T08:30:27Z

@amercader
So the responsibility of the API between user and goodatables CLI/API will be mostly job-management?

amercader · 2016-10-24T11:50:25Z

@roll Yes, I suppose so. Triggering jobs responding to calls from the integrations and returning status from them.
I'm still not clear at what level the auth should be managed though. It feels like it should be one level above the integrations, (ie users register and login to gt.io, maybe via Github, Google, etc) and then they enable the particular integration they want.

Also would be worth considering paid plans (quotas, different tiers of priority for jobs, etc) and how does this fit into the architecture.

pwalsh · 2016-10-31T13:28:30Z

@amercader

So, we are all agreed. Can you take these ideas and modify the requirements as current specified in the README, to reflect these changes, and any other changes that are relevant, in comparison to what I wrote originally.

pwalsh · 2016-10-31T14:05:17Z

@amercader and such changes, of course, can be a pull request that closes this issue ;)

Update Readme with thoughts from #2

pwalsh · 2016-11-01T11:33:32Z

Resolved in #9

pwalsh added the enhancement label Oct 31, 2016

pwalsh assigned amercader Oct 31, 2016

pwalsh mentioned this issue Oct 31, 2016

GoodTables.io scope and workplan frictionlessdata/frictionlessdata.io#314

Closed

pwalsh added enhancement and removed enhancement labels Oct 31, 2016

pwalsh modified the milestone: MVP Oct 31, 2016

amercader added a commit that referenced this issue Nov 1, 2016

Update Readme with thought from #2

70f5d1d

amercader added a commit that referenced this issue Nov 1, 2016

Merge pull request #9 from frictionlessdata/update-readme

c5e477d

Update Readme with thoughts from #2

pwalsh closed this as completed Nov 1, 2016

amercader mentioned this issue Nov 7, 2016

API to accept payloads from a frontend client #8

Closed

2 tasks

roll added duplicate and removed duplicate labels Nov 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on architecture #2

Thoughts on architecture #2

amercader commented Oct 20, 2016

pwalsh commented Oct 20, 2016

amercader commented Oct 20, 2016

pwalsh commented Oct 20, 2016

vitorbaptista commented Oct 21, 2016

roll commented Oct 24, 2016

amercader commented Oct 24, 2016

pwalsh commented Oct 31, 2016

pwalsh commented Oct 31, 2016

pwalsh commented Nov 1, 2016

Thoughts on architecture #2

Thoughts on architecture #2

Comments

amercader commented Oct 20, 2016

pwalsh commented Oct 20, 2016

amercader commented Oct 20, 2016

pwalsh commented Oct 20, 2016

vitorbaptista commented Oct 21, 2016

roll commented Oct 24, 2016

amercader commented Oct 24, 2016

pwalsh commented Oct 31, 2016

pwalsh commented Oct 31, 2016

pwalsh commented Nov 1, 2016