Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema editor #65

Open
amercader opened this issue Jul 28, 2022 · 5 comments
Open

Schema editor #65

amercader opened this issue Jul 28, 2022 · 5 comments
Labels
Milestone

Comments

@amercader
Copy link
Member

amercader commented Jul 28, 2022

Goal

Allow publishers to define the schema of tabular data as part of the resource creation process, internally generating a Table Schema that gets stored as the schema field

Prior work

@roll worked on an initial implementation a few years ago (ancient PR here: #25). It used tableschema-ui to render the UI, and under the hood tableschema-js to infer the data schema and generate a Table Schema object

ckanext-validation.mp4

Implementation options

UI-wise it is understood that we need update the component to use the new version,and that the UI/UX, form design, etc, needs to be definitely improved, but we have different options for the schema inferring part.

Option 1: Keep the inferring in the client with tableschema-js

Pros:

  • Better UX as the schema can be modified before uploading the file
  • Easier to integrate in CKAN's resource creation flow, ie we use the component to generate a JSON Table Schema that directly gets submitted in the schema field
  • File size doesn't seem to be a concern as I tested a 800Mb and the schema was inferred without issue, I assume it parses a subset of the rows

Cons:

  • What are the plans for tableschema-js? Can we rely on it long term?
  • How good is the inferring? I assume most if not all recent work on this area has gone to frictionless-py
  • Would the schema generated by tableschema-js match the one generated by frictionless-py? Right now this is not important but I can imagine us having to implement some sort of server-side inferring for background jobs, etc, could we find inconsistencies between schemas generated by the two systems?

Option 2: Use frictionless-py for the inferring

This of course requires the file to be uploaded to the server, as I don't think WASM-based solutions are ready for general production use.

Pros:

  • We focus our efforts in just one Frictionless library (fricitonless-py), the one that is arguably better supported

Cons:

  • 2-step process for creating a resource (3 if we count the previous dataset metadata step), file needs to be uploaded first, and then the schema can be returned to the user for tweaking.

Option 2a: Create the resource, infer the schema later

Users would create a resource normally and once is created we would infer the schema, redirect the user to a new step with the schema editor and allow them to tweak it further (but at this stage the inferred schema could already be stored in the created resource)

Option 2b: Upload the file first, infer the schema, create the resource later

This would be difficult to implement because right now uploads are closely tied to the actual resource, but we can imagine an implementation where the file is uploaded first (or linked), stored somewhere temporal, we run the inferring and return the result to the user, who then proceeds to create the resource, which is somehow linked to the uploaded file

@amercader amercader added the v2 label Jul 28, 2022
@amercader amercader added this to the v2 milestone Jul 28, 2022
@amercader
Copy link
Member Author

@roll, any thoughts on tableschema-js vs frictionless-py? (see above)

@roll
Copy link
Contributor

roll commented Nov 7, 2022

Hi @amercader,

I would vote for frictionless-py way as tableschema-js is more like in maintenance-only mode and a more realistic understanding of the situation is that OFKN will not be able to support it long-term. Actually, it's been already moved to the "Universe" from our core products and was maintained by Datopian (a little bit).

Technically, my suggestion would be:

  • creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.
  • on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step

I think a one-step arch is more promising as it might be used later to provide types for Data Pusher / Indexer. Although it needs to be investigated regarding compatibility with Excel/etc files

PS.
Regarding UI I think we need to wait a few months to have new generation of Frictionless Components released

@amercader
Copy link
Member Author

@roll sorry, revisiting this after a while. When you say

  • creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.
  • on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step

do you mean the following:

  1. User clicks on "Upload" and selects a file
  2. We listen to the file input event, and if it's a suitable file (ie tabular) we do a background HTTP request sending a sample of the file (or all of it if it's small enough) to an endpoint that gets a sample tabular data and outputs a Table Schema descriptor
  3. With the returned Table Schema descriptor returned we render the Schema Editor component

So essentially is option 2c: Upload a sample of the file, infer the schema, create the resource (and upload the file)
Conceptually it doesn't seem far from 2b but without the complexities of re-factoring the whole CKAN upload process, so it can be a first initial step.
Any thoughts on how big a sample we should upload to have reliable results?

Although it needs to be investigated regarding compatibility with Excel/etc files
What do you mean, that for Excel files we might not be able to get a sample?

@roll
Copy link
Contributor

roll commented Nov 21, 2022

@amercader
Yes, it's a good flow description 👍 By default, frictionless uses quite a minimalistic:

  • buffer size for encoding inference - 10 000 bytes
  • sample size for dialect/schema inference - 100 lines

In most cases, it works fine, and the user will be able to tweek the results anyway.

Regarding Excel, I think it will require sending the whole file to the server (or reading it client-side) just because of the format structure (ZIP index written at the end). I guess Excel is not so sensitive to the size problem as really big data usually in csv

@amercader
Copy link
Member Author

Revised implementation plan after discussion with @aivuk

  1. When the user selects a tabular file, we upload it in the background using a custom endpoint that:
    • Creates a new resource with just the uploaded file
    • Infers the schema using the whole uploaded file
    • Returns the new resource_id and the inferred schema
  2. The user can keep entering the rest of the fields and when we get the inferred schema, we update the UI to show a preview and the schema editor
  3. When the user clicks "Save" (or "Save and add another") we call another custom endpoint that calls resource_patch on the previously created resource with the rest of the values sent.
  4. If the user clicks "Cancel" (or leaves the page?) we delete the resource and the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants