Add a `categorical` field type [field property version] #68

khusmann · 2024-05-29T05:13:11Z

Here's the latest categorical alternative approach that simply extends the existing string and integer types rather than attempting to be a top-level field type.

More info / rationale here in this thread from the previous attempt PR: frictionlessdata/datapackage#48 (comment)

@pschumm @ezwelty @djvanderlaan

djvanderlaan · 2024-05-29T07:21:10Z

Tagging Albert-Jan @fomcl

content/docs/specifications/table-schema.md

ezwelty · 2024-05-29T19:50:16Z

content/docs/specifications/table-schema.md

+
+`string` and `integer` field types `MAY` include a `categories` property to indicate that the field contains categorical data, and the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be an array of values or an array of objects that define the levels of the categorical.
+
+When the `categories` property is an array of values, the values `MUST` be unique and `MUST` match logical values of the field. For example:


MUST match logical values of the field

This sounds like categories cannot contain a value that is not present in the data, but I believe we intend the reverse: the field cannot contain a value that is not in categories. It also seems that the unique constraint should apply whether an array or array of objects.

Good points! Just made some clarifications in the latest commits. Let me know if it looks good or if you have other rephrasings I should try!

…the data

ezwelty · 2024-05-30T09:44:45Z

@khusmann I'm thumbs up on this approach since it solves the typing confusion, but I feel the description is a tad wordy and technical sounding, mixing "level" and "value" and "categories" for the same thing and not stating until much later that categories restricts the valid values of a field. I also don't see why the labels should be required to be unique. R allows this ("Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level."). It is also odd to suggest (I suspect in error) that the label property MUST be human-readable (they can be gobbledygook if that's what people want). Here is my attempt at an edit (I presume you are able to edit/view the raw markdown?):

string and integer field types MAY include a categories property to restrict the field to a finite set of possible values (similar to an enum constraint) and indicate that the field MAY be loaded as a categorical data type if supported by the implementation. The categories property MUST be either (a) an array of unique values or (b) an array of objects, each with a unique value property. The logical representation of data in the field MUST exactly match one of the values in categories.

Suppose we have a field fruit with possible values "apple", "orange", or "banana". The field definition would look like this if categories is (a) an array of values:

{
  "name": "fruit",
  "type": "string",
  "categories": ["apple", "orange", "banana"]
}

If categories is (b) an array of objects, each object MAY also have a label property, which when present, MUST be a string. In our example, this allows us to store our fruit with values 0, 1, and 2 in an integer field and label them as "apple", "orange", and "banana":

{
  "name": "fruit",
  "type": "integer",
  "categories": [
    { "value": 0, "label": "apple" },
    { "value": 1, "label": "orange" },
    { "value": 2, "label": "banana" }
  ]
}

When the categories property is defined, it MAY be accompanied by a categoriesOrdered property in the field definition. When present, the categoriesOrdered property MUST be boolean. When categoriesOrdered is true, implementations SHOULD regard the order of appearance of the values in the categories property as their natural order. For example:

{
  "name": "agreementLevel",
  "type": "integer",
  "categories": [
    { "value": 1, "label": "Strongly Disagree" },
    { "value": 2 },
    { "value": 3 },
    { "value": 4 },
    { "value": 5, "label": "Strongly Agree" }
  ],
  "categoriesOrdered": true
}

When the property categoriesOrdered is false or not present, implementations SHOULD assume that the categories do not have a natural order.

An enum constraint MAY be added to a field with a categories property, but if so, the enum values MUST be a subset of the values in categories.

khusmann · 2024-05-30T14:08:09Z

Awesome @ezwelty, thanks for these edits! Just merged them in. It definitely reads a lot smoother now.

I also don't see why the labels should be required to be unique. R allows this ("Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level.").

This was discussed in an earlier thread -- we decided against allowing this because collapsing categories should be considered a separate operation: frictionlessdata/datapackage#875 (comment)

pschumm · 2024-06-02T12:49:37Z

When the property categoriesOrdered is false or not present, implementations SHOULD assume that the categories do not have a natural order.

I think this latest round of edits is excellent, though I have a concern with the sentence above; specifically, when categoriesOrdered is not present, I believe that no assumptions should be made. For example, this property might be excluded because the data producer may not be familiar with the analytic concept of an ordinal variable. Alternatively, there can occasionally be legitimate ambiguity about whether a variable is ordered or not, and the data producer may have chosen to represent this by leaving this property off (i.e., leaving it up to the data consumer to decide this). Finally, the property may have simply been excluded in error. Thus, I would prefer that we say the following:

When the property categoriesOrdered is false, implementations SHOULD assume that the categories do not have a natural order; when the property is not present, no assumption about the ordered nature of the values SHOULD be made.

khusmann · 2024-06-03T18:33:02Z

Thanks @pschumm! Just merged your edit.

Alternatively, there can occasionally be legitimate ambiguity about whether a variable is ordered or not

This is the most convincing argument to me. This effectively means we have 3 valid types of ordering for categoricals: unordered, ordered, and unknown. Then, when an implementation needs to convert it to an unordered or ordered type for analysis, summary, display, etc. it could warn ("No ordering specified for categorical, assuming unordered") or prompt the user to choose how it should be handled for that action.

We'll also encounter unknown ordering when importing categoricals from a source that doesn't support ordering (e.g. a DuckDB / Parquet enum) -- it's good to have a representation for that case, instead of making assumptions about it.

peterdesmet · 2024-06-04T11:38:03Z

Nice work @khusmann (and co-authors)! Reads well and leaves room to implement where useful, with a reasonable fallback to just regular string/integer (with enum).

roll · 2024-06-05T08:20:54Z

@khusmann
Absolutely tremendous work of leading this through all these iterations 👏 And big thanks to all the contributors 🎉

roll · 2024-06-05T08:21:42Z

ACCEPTED by WG (6/9)

peterdesmet · 2024-06-05T08:36:55Z

@roll, while this is now expressed as documentation, I assume changes need to be made to the profiles as well? https://github.com/frictionlessdata/datapackage/tree/main/profiles/source

roll · 2024-06-05T08:38:30Z

Sorry missed it. I'll add it now

peterdesmet · 2024-06-05T08:43:10Z

Great! As a new PR I assume?

khusmann added 2 commits May 28, 2024 22:05

add categories / categoriesOrdered field properties

c4bc097

add modifications to missingValues

daedf3a

minor formatting edits

d70253d

ezwelty reviewed May 29, 2024

View reviewed changes

khusmann added 3 commits May 29, 2024 13:15

fix typo in missingValues section

734faab

clarify that defined categorical levels do not have to be present in …

d11abef

…the data

clarify uniqueness of values / labels

fb81a04

reword with @ezwelty's edits

ff839dc

when categoriesOrdered is not present, do not assume nature of order

440f7c8

Merge branch 'main' into categorical_ft2

7c73f17

roll merged commit e05eb9a into frictionlessdata:main Jun 5, 2024
1 check passed

This was referenced Jun 5, 2024

Updated the profiles to reflect categories and labeled missingness #70

Merged

Support for labeled missingness frictionlessdata/datapackage#880

Closed

Promote the Enum Labels and Ordering pattern to the Table Schema spec? frictionlessdata/datapackage#875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `categorical` field type [field property version] #68

Add a `categorical` field type [field property version] #68

khusmann commented May 29, 2024

djvanderlaan commented May 29, 2024

ezwelty May 29, 2024

khusmann May 29, 2024

ezwelty commented May 30, 2024

khusmann commented May 30, 2024

pschumm commented Jun 2, 2024

khusmann commented Jun 3, 2024

peterdesmet commented Jun 4, 2024

roll commented Jun 5, 2024

roll commented Jun 5, 2024

peterdesmet commented Jun 5, 2024

roll commented Jun 5, 2024

peterdesmet commented Jun 5, 2024


		`string` and `integer` field types `MAY` include a `categories` property to indicate that the field contains categorical data, and the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be an array of values or an array of objects that define the levels of the categorical.

		When the `categories` property is an array of values, the values `MUST` be unique and `MUST` match logical values of the field. For example:

Add a categorical field type [field property version] #68

Add a categorical field type [field property version] #68

Conversation

khusmann commented May 29, 2024

djvanderlaan commented May 29, 2024

ezwelty May 29, 2024

Choose a reason for hiding this comment

khusmann May 29, 2024

Choose a reason for hiding this comment

ezwelty commented May 30, 2024

khusmann commented May 30, 2024

pschumm commented Jun 2, 2024

khusmann commented Jun 3, 2024

peterdesmet commented Jun 4, 2024

roll commented Jun 5, 2024

roll commented Jun 5, 2024

peterdesmet commented Jun 5, 2024

roll commented Jun 5, 2024

peterdesmet commented Jun 5, 2024

Add a `categorical` field type [field property version] #68

Add a `categorical` field type [field property version] #68