Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dialect.type property (with table dialect reorganized) #82

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

khusmann
Copy link
Contributor

@khusmann khusmann commented Jun 24, 2024

Here's another attempt at dialect.type, rebased on Peter's new structure (see #74).

dialect.type also makes it possible to make different fields required for different dialect types. For example, for a database dialect type, does it make sense to have table be an optional property? With spreadsheets, there's at least the concept of the "first" spreadsheet that can be loaded by default -- but with databases, is there such thing as a "first" table? If not, I think dialect.table should be required for database dialect types.

@khusmann
Copy link
Contributor Author

(Tagging @peterdesmet @ezwelty for review!)

@khusmann khusmann changed the title Add a dialect.type property (with table dialect reorganized) Add a dialect.type property (with table dialect reorganized) Jun 24, 2024
@roll
Copy link
Member

roll commented Jun 25, 2024

Looks great @khusmann!

BTW I think we need to list type under Properties as well for consistency with other specs (and in the listings under type headings). We can use our standard language like A Table Dialect descriptor MAY contain a property "type" that MUST be a string with the following possible values and the "delimited" value by default:

I agree that table needs to be required

@khusmann
Copy link
Contributor Author

@roll excellent points

One more thing I just realized -- $schema should only default to v1.0 (CSV Dialect) when dialect.type is not set. Otherwise it should use v2.0.

Update summary:

  • added type under properties, as well as references under the type headings.
  • required the table property for databases
  • made $schema default to v1.0 when dialect.type is not set, but then use v2.0 when dialect.type is set.

@roll
Copy link
Member

roll commented Jun 27, 2024

@khusmann
Unfortunately, we must migrate this to the merged spec/datapackage site - https://github.com/frictionlessdata/datapackage. I can assist but it will be great to keep author credits

@khusmann
Copy link
Contributor Author

khusmann commented Jun 27, 2024

No worries! This would mean it would be for v2.1, right? Unfortunately, I'm not sure dialect.type really gives us much value unless it's in v2.0. If we put it in v2.1 we would have to fallback on dialect discovery when dialect.type is undefined, rather than defaulting to delimited. (To maintain backward compatibility). So rather than simplifying implementation and clarifying producer dialect intent by making the descriptor a tagged union, it'd just become an extra property that data producers may or may not use in their definitions.

It's not a super big deal to miss out on this, it just means that we can't do as much up-front validation of table dialect descriptors (and make the data producer's dialect type choice explicit) -- implementations will have to postpone the final validation of the dialect properties until the data file format is determined.

Moving forward, I think it'd be better to focus on expanding / clarifying which data formats can/should be associated with which dialect types (because the data format now has effectively become the "tag" in this type union)

Copy link

@ezwelty ezwelty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khusmann Thanks for the great work here. I made several suggestions with respect to wording, and some with respect to logic.

I think the structure works, but wonder whether it was a good idea to separate the types from their properties. Sure, it avoids some repetition, but leads to what I see are some bigger issues:

  1. It is more complicated to assemble a descriptor, since one must first look up the type, then read the definition of each property associated with it, then back to the type definition for the relationships between the properties and their defaults.
  2. The defaults for each property are defined in multiple places: once where the property is defined, and once where each associated type is defined. So what happens when the default depends on the type, and/or on the value of another property?
  3. We say "A Table Dialect descriptor MAY have the X property" but this is misleading for properties that are required for some types, or may be required depending on the value or presence/absence of another property (again, type-specific).
  4. Most importantly, the definition of each property is best written in the context of the type. For example header has a different meaning for delimited (line with field names) than for structured with itemType: array (array with field names), and commentChar has (implicitly) a different meaning for delimited (first character of line) than for spreadsheet (first character of cell A in a row).

This isn't the direction the documentation has been evolving, but I would have suggested a structure similar to Table Schema's field type and "additional properties" (e.g. an integer field's decimalChar), so that each type is described in full in one place, and repetition can be avoided when appropriate by referring to the property's definition in a previously mentioned type. Each type section would include a full list of properties (as now), their defaults and relationships (somewhat as now), short but type-specific descriptions (omitting the "A Table Dialect descriptor MAY have the X property that ..."), and type-specific example(s).

@@ -35,44 +35,31 @@ Table Dialect supersedes [CSV Dialect](https://specs.frictionlessdata.io/csv-dia

## Descriptor

Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. A list of standard properties that can be included into a descriptor is defined in the [Properties](#properties) section.
Table Dialect descriptor `MUST` be a descriptor as per [Descriptor](../glossary/#descriptor) definition. The descriptor `MAY` include a `type` property to indicate which of optional dialect properties `SHOULD` be considered when reading the target data format. Some [properties](#properties) are generic and can be used for multiple formats, while others are specific to one format. A list of standard dialect types are defined in the [Table Dialect Types](#table-dialect-types) section. The properties that can be used with these types are defined in the [Properties](#properties) section.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The first sentence is a bit wonky with regards to articles. How about: "A Table Dialect descriptor MUST be a descriptor as per the Descriptor definition".
  2. Not all properties are optional, so I would simplify "which of optional dialect properties" to "which properties".
  3. "generic" can have special meanings, and we should clarify that properties are assigned to type, not "target data format". So I would suggest "Some properties apply to multiple types, while others are specific to one type".


For the sake of simplicity, most of examples are written in the CSV data format. For example, this data file without providing any Table Dialect properties:
Table Dialect can be used for different data formats, such as delimited text files, semi-structured formats and spreadsheets. The list of supported dialect types with associated data formats and related properties is as follows.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table Dialect can be used for different data formats

Table Dialect can be used for different tabular data formats

I know it is in the name "Table Dialect", but I feel that it doesn't hurt to repeat it.


`SHOULD` output this data:
Delimited formats are textual formats such as CSV and TSV. Their charactistics can be expressed the following properties:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This definition feels rather circular to me (uses the term "delimited" and undefined acronyms). How about something like "Text-based formats where each line represents a row and row values are separated by specific delimiter characters, such as CSV (comma-separated values)."
  2. The second sentence needs fixing: "Their charactistics [→ characteristics] can be expressed [by] the following properties". However, I think it is better simpler: "They can be described with the following properties:"
  3. Should we provide a brief example?
id,name
1,apple
2,orange


Structured formats is a group of structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties:
Structured formats are structured or semi-structured formats such as JSON and YAML. Their charactistics can be expressed the following properties:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Again, this definition feels rather circular to me, and clearly we don't mean to support any random JSON or YAML. So would this be accurate? "Tabular data represented using structured formats such as JSON and YAML as either an array of objects or an array of arrays." Is there an existing place in the documentation we can link to that defines these? Ah, I see that it is buried in the itemType definition. This doesn't feel ideal.
  2. Again, I suggest "They can be described with the following properties:"
  3. Examples for each type?
[
  {"id": 1, "name": "apple"},
  {"id": 2, "name": "orange"}
]
[
  ["id", "name"],
  [1, "apple"],
  [2, "orange"]
]


- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [type](#type): `structured`
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default
- [header](#header): `true` by default
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect header default for type structured to be true if itemType is array and false or undefined if itemType is object.

@@ -576,6 +623,8 @@ With this dialect definition:

### `sheetName`

**Dialect Types:** [spreadsheet](#spreadsheet)

A Table Dialect descriptor `MAY` have the `sheetName` property that `MUST` be a string; undefined by default. This property specifies a sheet name of a table in the spreadsheet file.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sheetName is specified, does it take precedence over sheetNumber?

A Table Dialect descriptor `MAY` have the `table` property that `MUST` be a string; undefined by default. This property specifies a name of the table in the database.
**Dialect Types:** [database](#database)

A Table Dialect of type `database` `MUST` have a `table` property of type string. This property specifies a name of the table in the database.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PostgreSQL has the concept of "schemas" (namespaces), such that a table's name may not be sufficient to identify it; the schema name may also be needed. Is this not supported, or can the name be a qualified name {schema}.{table}?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently is not supported but there is a issue (frictionlessdata/datapackage#947) for adding it to version 2.1.

I think using {schema}.{table} in dialect.table would be problematic because . is valid in table names in some databases.


- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [type](#type): `delimited`
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lists of properties should be by order of importance, not alphabetical. So delimiter first, then header, ...


- [$schema](#schema): `https://datapackage.org/profiles/1.0/tabledialect.json` by default
- [type](#type): `structured`
- [$schema](#schema): `https://datapackage.org/profiles/2.0/tabledialect.json` by default
- [header](#header): `true` by default
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lists of properties should be by order of importance, not alphabetical. So property first (where), then itemType, then itemKeys and header (which only apply to one value of itemType).

@@ -123,23 +113,34 @@ Spreadsheet formats is a group of sheet-based formats such as Excel or ODS. Thei
- [sheetNumber](#sheetnumber): `1` by default
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lists of properties should be by order of importance, not alphabetical. So sheetName / sheetNumber first (where), followed by the rest.

@khusmann
Copy link
Contributor Author

khusmann commented Jun 27, 2024

@ezwelty Thanks for these comments! But as I mentioned in my previous comment, I'm second-guessing the dialect.type property now because we missed the v2.0 release -- if we put it in v2.1, to maintain backward compatibility with v2.0 we'd have to default to using the data format to determine which dialect properties are relevant, rather than have the intended dialect type always locally indicated during schema validation via dialect.type. Implementations won't be able to count on dialect.type existing to distinguish between the available dialect types, making any use of dialect.type redundant. Also data producers won't have clear motivation to use it: e.g. why use { "type": "spreadsheet", "sheetName": "Sheet 2" } when you can just { "sheetName": "Sheet 2" }. Perhaps you see this differently though? I'm open to being convinced otherwise.

That said I think many of your suggestions you made here are relevant to the organization of table dialect spec even without dialect.type -- maybe we should continue this discussion in a new issue thread in the datapackage repo?

@roll
Copy link
Member

roll commented Sep 17, 2024

Hi @khusmann,

I think thanks to our new immutable profiles concept the change still makes sense even after v2.0 because to start using v2 branch a data publisher needs explicitly set the $schema property to a new version. Currently, very few publishers already migrated to v2.0 so releasing improvements as new versions totally makes sense in my opinion. I can help with migrating this PR to the old/new repo if you'd like

@khusmann
Copy link
Contributor Author

@roll good to hear! I've had my hands full the last couple weeks as I have been transitioning jobs -- if you can help with migrating this PR I'd super appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants