Add a `dialect.type` property (with table dialect reorganized) #82

khusmann · 2024-06-24T18:09:28Z

Here's another attempt at dialect.type, rebased on Peter's new structure (see #74).

dialect.type also makes it possible to make different fields required for different dialect types. For example, for a database dialect type, does it make sense to have table be an optional property? With spreadsheets, there's at least the concept of the "first" spreadsheet that can be loaded by default -- but with databases, is there such thing as a "first" table? If not, I think dialect.table should be required for database dialect types.

khusmann · 2024-06-24T18:11:03Z

(Tagging @peterdesmet @ezwelty for review!)

roll · 2024-06-25T08:52:46Z

Looks great @khusmann!

BTW I think we need to list type under Properties as well for consistency with other specs (and in the listings under type headings). We can use our standard language like A Table Dialect descriptor MAY contain a property "type" that MUST be a string with the following possible values and the "delimited" value by default:

I agree that table needs to be required

khusmann · 2024-06-25T17:18:03Z

@roll excellent points

One more thing I just realized -- $schema should only default to v1.0 (CSV Dialect) when dialect.type is not set. Otherwise it should use v2.0.

Update summary:

added type under properties, as well as references under the type headings.
required the table property for databases
made $schema default to v1.0 when dialect.type is not set, but then use v2.0 when dialect.type is set.

roll · 2024-06-27T12:14:33Z

@khusmann
Unfortunately, we must migrate this to the merged spec/datapackage site - https://github.com/frictionlessdata/datapackage. I can assist but it will be great to keep author credits

khusmann · 2024-06-27T17:34:20Z

No worries! This would mean it would be for v2.1, right? Unfortunately, I'm not sure dialect.type really gives us much value unless it's in v2.0. If we put it in v2.1 we would have to fallback on dialect discovery when dialect.type is undefined, rather than defaulting to delimited. (To maintain backward compatibility). So rather than simplifying implementation and clarifying producer dialect intent by making the descriptor a tagged union, it'd just become an extra property that data producers may or may not use in their definitions.

It's not a super big deal to miss out on this, it just means that we can't do as much up-front validation of table dialect descriptors (and make the data producer's dialect type choice explicit) -- implementations will have to postpone the final validation of the dialect properties until the data file format is determined.

Moving forward, I think it'd be better to focus on expanding / clarifying which data formats can/should be associated with which dialect types (because the data format now has effectively become the "tag" in this type union)

ezwelty

@khusmann Thanks for the great work here. I made several suggestions with respect to wording, and some with respect to logic.

I think the structure works, but wonder whether it was a good idea to separate the types from their properties. Sure, it avoids some repetition, but leads to what I see are some bigger issues:

It is more complicated to assemble a descriptor, since one must first look up the type, then read the definition of each property associated with it, then back to the type definition for the relationships between the properties and their defaults.
The defaults for each property are defined in multiple places: once where the property is defined, and once where each associated type is defined. So what happens when the default depends on the type, and/or on the value of another property?
We say "A Table Dialect descriptor MAY have the X property" but this is misleading for properties that are required for some types, or may be required depending on the value or presence/absence of another property (again, type-specific).
Most importantly, the definition of each property is best written in the context of the type. For example header has a different meaning for delimited (line with field names) than for structured with itemType: array (array with field names), and commentChar has (implicitly) a different meaning for delimited (first character of line) than for spreadsheet (first character of cell A in a row).

This isn't the direction the documentation has been evolving, but I would have suggested a structure similar to Table Schema's field type and "additional properties" (e.g. an integer field's decimalChar), so that each type is described in full in one place, and repetition can be avoided when appropriate by referring to the property's definition in a previously mentioned type. Each type section would include a full list of properties (as now), their defaults and relationships (somewhat as now), short but type-specific descriptions (omitting the "A Table Dialect descriptor MAY have the X property that ..."), and type-specific example(s).

ezwelty · 2024-06-27T13:23:14Z