Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

jqnatividad · 2024-07-18T18:39:05Z

Currently, qsv creates, consumes and validates CSV files hewing closely to the RFC4180 specification as interpreted by the csv crate.

However, it doesn't allow us to save additional metadata - about the CSV file (dialect, delimiter used, comments, DOI, url, etc.) nor the data the file contains (summary statistics, data dictionary, creator, last updated, hash of the data, etc.)

The request is to create a .qsv file format that is an implementation of W3C's CSV on the Web specification using guidance on https://csvw.org and store schemata/metadata/data in the qsv file that includes not just the schema info, but summary and frequency statistics as well; container for DCAT 3/CKAN package/resource metadata; etc.

Doing so will unlock additional capabilities in qsv, qsv pro, Datapusher+ and CKAN.

It will also allow us to "clean-up" and consolidate the "metadata" files that qsv creates - the stats cache files, the index file, etc. and package up the CSV and its associated metadata in one container as a signed zip file.

It will also make "harvesting" and federation with CKAN easier and more robust as all the needed data/metadata is in one container.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-07-18T18:50:32Z

Also consider https://digital-preservation.github.io/csv-schema/

rzmk · 2024-08-02T00:31:42Z

Experimenting with this:

Sample .qsv file in this ZIP: fruits.qsv.zip (can't share .qsv on GitHub).

jqnatividad · 2024-08-18T12:37:52Z

For comparison, note that several popular file formats are actually compressed "packages":

All the Open Office File formats (docx, pptx and xlsx) - https://support.microsoft.com/en-us/office/open-xml-formats-and-file-name-extensions-5200d93c-3449-4380-8e11-31ef14555b18
xlsx in particular - is a zipped XML file https://www.onlyoffice.com/blog/2024/03/xlsx
Shapefile - is composed of multiple files, often distributed as a zip file

rzmk · 2024-08-18T14:59:46Z

May be nice if the .qsv file is verified to be validated or there's a flag that can be quickly checked to see if it is or not along with whether an index is available.

jqnatividad · 2024-08-18T18:57:29Z

Right @rzmk ! The .qsv file, once implemented, is guaranteed to be ALWAYS valid, as the associated metadata/cache files will always be consistent with the core DATA stored in the archive. We can further ensure security by zipsigning the file so it cannot be tampered.

Further, we can assign a Digital Object Identifier (DOI) to each qsv file so we can track/trace its provenance, and possibly, downstream use.

jqnatividad · 2024-08-19T12:24:30Z

If done properly, even with all the extra metadata in the .qsv package, a .qsv file will be even smaller than the raw version of the CSV!
This is because CSV files tend to have very high compression ratios - typically 80-90%, and all that extra metadata (stats, frequency tables, etc.) are tiny, just a few KBs, even for multi-gigabyte CSV files.

jqnatividad · 2024-08-30T22:50:16Z

The qsv file will contain the cache file (#2097 ).
It will also have all the metadata describing the dataset using the DCAT 3 (particularly, the DCAT-US v3 spec for the first implementation)

jqnatividad · 2024-08-31T22:27:14Z

Related to #1705.
The profile command will create the .qsv file.

Orcomp · 2024-10-09T04:36:18Z

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

jqnatividad · 2024-10-09T15:13:52Z

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

Thanks @Orcomp , do you have any benchmarks/metrics you can share? For Zstandard and other compression algorithms you considered?

Orcomp · 2024-10-10T11:52:30Z

You can check out https://morotti.github.io/lzbench-web

(From my personal experience, zstd has a good balance between compression ratio and compress/decompress speeds. I looked into this 2-3 years ago, so things might have changed a bit since.)

jqnatividad · 2024-10-15T02:04:40Z

Instead of just signing the qsv using conventional techniques, "explore using two emerging standards: the W3C Verifiable Credentials Data Model 2.0 and Decentralized Identifiers (DIDs) v1.0 that leverage NIST's FIPS 186-5 but also align well with DCAT RDF model, making both human and machine readable."

See DOI-DO/dcat-us#132

jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services datapusher+ for Datapusher+ labels Jul 18, 2024

jqnatividad changed the title ~~Create a .qsv file that is an implementation of W3C's CSV on the Web~~ Create a .qsv file format that is an implementation of W3C's CSV on the Web Jul 19, 2024

jqnatividad added the CKAN interoperability with CKAN Data Management System label Jul 19, 2024

jqnatividad added the DCAT3 label Aug 30, 2024

BrewTestBot mentioned this issue Oct 8, 2024

qsv 0.136.0 Homebrew/homebrew-core#193278

Merged

jqnatividad pinned this issue Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

jqnatividad commented Jul 18, 2024 •

edited by rzmk

Loading

jqnatividad commented Jul 18, 2024

rzmk commented Aug 2, 2024

jqnatividad commented Aug 18, 2024

rzmk commented Aug 18, 2024

jqnatividad commented Aug 18, 2024 •

edited

Loading

jqnatividad commented Aug 19, 2024

jqnatividad commented Aug 30, 2024 •

edited

Loading

jqnatividad commented Aug 31, 2024 •

edited

Loading

Orcomp commented Oct 9, 2024

jqnatividad commented Oct 9, 2024

Orcomp commented Oct 10, 2024

jqnatividad commented Oct 15, 2024

Create a .qsv file format that is an implementation of W3C's CSV on the Web #1982

Create a .qsv file format that is an implementation of W3C's CSV on the Web #1982

Comments

jqnatividad commented Jul 18, 2024 • edited by rzmk Loading

jqnatividad commented Jul 18, 2024

rzmk commented Aug 2, 2024

jqnatividad commented Aug 18, 2024

rzmk commented Aug 18, 2024

jqnatividad commented Aug 18, 2024 • edited Loading

jqnatividad commented Aug 19, 2024

jqnatividad commented Aug 30, 2024 • edited Loading

jqnatividad commented Aug 31, 2024 • edited Loading

Orcomp commented Oct 9, 2024

jqnatividad commented Oct 9, 2024

Orcomp commented Oct 10, 2024

jqnatividad commented Oct 15, 2024

Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

Create a `.qsv` file format that is an implementation of W3C's CSV on the Web #1982

jqnatividad commented Jul 18, 2024 •

edited by rzmk

Loading

jqnatividad commented Aug 18, 2024 •

edited

Loading

jqnatividad commented Aug 30, 2024 •

edited

Loading

jqnatividad commented Aug 31, 2024 •

edited

Loading