From 6712fe456dd77e065f6e460f240422144289607a Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sun, 6 Aug 2023 08:29:40 -0500 Subject: [PATCH 01/16] Add pattern supporting use of value labels, categoricals and factors --- patterns/README.md | 368 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 368 insertions(+) diff --git a/patterns/README.md b/patterns/README.md index 44acb48a..c64262e5 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1030,3 +1030,371 @@ A field MAY have a `missingValues` property that MUST be an `array` where each e ### Implementations None known. + +## Facilitate use of value labels (Stata, SAS and SPSS), categoricals (Python) and factors (R) in software that supports them + +### Overview + +Many software packages for manipulating and analyzing tabular data have special +features for working with categorical variables. These include: + +- Value labels or formats ([Stata](https://www.stata.com/manuals13/dlabel.pdf), + [SAS](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p1upn25lbfo6mkn1wncu4dyh9q91.htm) + and [SPSS](https://www.ibm.com/docs/en/spss-statistics/beta?topic=data-adding-value-labels)) +- [Categoricals (Pandas)](https://pandas.pydata.org/docs/user_guide/categorical.html) +- [Factors (R)](https://www.stat.berkeley.edu/~s133/factors.html) +- [CategoricalVectors (Julia)](https://dataframes.juliadata.org/stable/man/categorical/) + +These features can result in more efficient storage and faster runtime +performance, but more importantly, facilitate analysis by indicating that a +variable should be treated as categorical and by permitting the logical order +of the categories to differ from their lexical order. And in the case of value +labels, they permit the analyst to work with variables in numeric form (e.g., +in expressions, when fitting models) while generating output (e.g., tables, +plots) that is labeled with informative strings. + +While these features are of limited use in some disciplines, others rely +heavily on them (e.g., social sciences, epidemiology, clinical research, +etc.). Thus, before these disciplines can begin to use Frictionless in a +meaningful way, both the standards and the software tools need to support +these features. This pattern addresses the necessary extensions to the +[table schema](https://specs.frictionlessdata.io//table-schema/). + +### Principles + +Before describing the proposed extensions, here are the principles on which +they are based: + +1. Extensions should be software agnostic (i.e., no additions to the official + schema targeted toward a specific piece of software). While the extensions + are intended to support the use of features not available in all software, + the resulting data package should continue to work as well as possible with + software that does not have those features. +2. Related to (1), extensions should only include metadata that describe the + data themselves—not instructions for what a specific software package should + do with the data. Users who want to include the latter may do so within + a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103) + and [#663](https://github.com/frictionlessdata/specs/issues/663)). +3. Extensions should be feature-complete (i.e., they should permit full + support of value labels, categoricals and factors by software tools). +4. Extensions must be backward compatible (i.e., not break existing tools, + workflows, etc. for working with Frictionless packages). + +It is worth emphasizing that the scope of the proposed extensions is strictly +limited to the information necessaary to make full use of the features for +working with categorical data provided by the software packages listed above. +Previous discussions of this issue have occasionally included references to +additional variable-level metadata (e.g., multiple sets of category labels +such as both "short labels" and longer "descriptions", or links to common data +elements, controlled vocabularies or rdfTypes). While these additional +metadata are undoubtedly useful, we speculate that the large majority of users +who would benefit from the extensions propopsed here would not have and/or +utilize such information, and therefore argue that these should be considered +under a separate proposal. + +### Implementations + +We note that our proposal regarding field-specific missing values has been +discussed frequently in numerous contexts, and is nearly identical to the pattern +[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) +appearing in this document above. + +Our proposal to add a field-specific `ordered` property has been raised +[here](https://github.com/frictionlessdata/specs/issues/739) and +[here](https://github.com/frictionlessdata/specs/issues/156). + +Discussions regarding supporting software providing features for working with +categorical variables appear in the following GitHub issues: + +- [https://github.com/frictionlessdata/specs/issues/156](https://github.com/frictionlessdata/specs/issues/156) +- [https://github.com/frictionlessdata/specs/issues/739](https://github.com/frictionlessdata/specs/issues/739) + +and in the Frictionless Data forum: + +- [https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/](https://discuss.okfn.org/t/can-you-add-code-descriptions-to-a-data-package/) +- [https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/](https://discuss.okfn.org/t/something-like-rs-ordered-factors-or-enums-as-column-type/) + +Finally, while we are unaware of any existing implementations intended for +general use, it is likely that many users are already exploiting the fact that +arbitrary fields may be added to the +[table schema](https://specs.frictionlessdata.io//table-schema/) +to support internal implementations (e.g., our group is doing so). + +### Proposed extensions + +We propose three extensions: + +1. Add an optional field-specific `missingValues` property. This is necessary + so that such values can be included in the definition of a categorical + (e.g., `["Yes", "No", "Don't know", "Refused"]`) or a value label, but + still ignored by software without such features. Note that unlike the + [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) + pattern above, we propose that field-specific missing values be *added* to + the values appearing in the `missingValues` property at the resource level, + rather than replacing them. This is so that software can distinguish + between so-called *system missing values* (e.g., "Not applicable") and + other values that you may wish to include in certain tabulations/analyses + but exclude from others (e.g., "Don't know" or "Refused"). +2. Add an optional field-specific `ordered` property, which can be used when + contructing a categorical (or factor) to indicate that the variable is + ordinal. +3. Add an optional field-specific `encoding` property for use when data are + stored using integer or other codes rather than using the category labels. + This contains an object mapping the codes appearing in the data (keys) to + what they mean (values), and can be used by software to construct + corresponding value labels or categoricals (when supported) or to translate + the values when reading the data. + +As none of the three proposed properties is part of the current +[table schema](https://specs.frictionlessdata.io//table-schema/), the proposed +extensions are fully backward compatible. + +Here is an example using extensions (1) and (2): + +``` +{ + "fields": [ + { + "name": "physical_health", + "type": "string", + "constraints": { + "enum": [ + "Poor", + "Fair", + "Good", + "Very good", + "Excellent", + ] + } + "ordered": true + "missingValues": ["Don't know","Refused"] + } + ], + "missingValues": ["Not applicable","No answer"] +} +``` + +This is our preferred strategy, as it provides all of the information +necessary to support fully the categorical functionality of the software +packages listed above, while still yielding a useable result for software +without such capability. As described below, value labels or categoricals can +be created automatically based on the ordering of the values in the `enum` +array, and the field level `missingValues` can be incorporated into the value +labels or categoricals if desired. In those cases where it is desired to have +more control over how the value labels are constructed, this information can +be stored in a separate encodings file in JSON format or as part of a custom +extension to the table schema. Since such instructions do not describe the +data themselves (but only how a specific software package should handle them), +and since they are often software- and/or user-specific, we argue that they +should not be included in the official table schema. + +Alternatively, those who wish to store their data in encoded form (e.g., this +is the default for data exports from [REDCap](https://projectredcap.org), a +commonly-used platform for collecting data for clinical studies) may use +extension (3) to do so: + +``` +{ + "fields": [ + { + "name": "physical_health", + "type": "integer", + "enum": [1,2,3,4,5] + "ordered": true + "missingValues": ["Don't know","Refused"] + "encoding": { + "1": "Poor", + "2": "Fair", + "3": "Good", + "4": "Very good", + "5": "Excellent" + } + } + ], + "missingValues": ["Not applicable","No answer"] +} +``` + +Note that although the field type is `integer`, the keys in the encoding +object must be enclosed in double quotes because this is required by the JSON +specification. + +A second variant of the example above is the following: + +``` +{ + "fields": [ + { + "name": "physical_health", + "type": "integer", + "enum": [1,2,3,4,5] + "ordered": true + "missingValues": [".a",".b"] + "encoding": { + "1": "Poor", + "2": "Fair", + "3": "Good", + "4": "Very good", + "5": "Excellent", + ".a": "Don't know", + ".b": "Refused" + } + } + ], + "missingValues": ["."] +} +``` + +This represents encoded data exported from software with support for value +labels. The values `.a`, `.b`, etc. are known as *extended missing values* +(Stata and SAS only) and provide 26 unique missing values for numeric fields +(both integer and float) in addition to the system missing value ("`.`"); in +SPSS these would be replaced with designated numbers (e.g., -97, -98 and -99). + +Note that one might argue that the encoding property should instead be +specified as: + +``` +{ + "encoding": { + "Poor": 1, + "Fair": 2, + "Good": 3, + "Very good": 4, + "Excellent": 5 +} +``` + +since that represents the encoding that has been applied to the data, and the +table in the example is what is now necessary to *decode* the data. However, +there are at least three arguments in favor of the proposed specification. +First, it is the way value labels are uniformly written (e.g., in Stata, SAS +and SPSS). Second, it automatically imposes the necessary constraint that the +codes are unique (since a JSON object's keys must be unique). Third, it +simplifies working with the encoding programmatically, since it can be read as +an associative array and then applied directly to decode to the data (e.g., +using `DataFrame.replace()` in Pandas). + +### Specification + +1. A field MAY have a `missingValues` property that MUST be an `array` where + each entry is a `string`. If not specified, each field shall inherit the + entries in the `missingValues` property at the level of the tabular data + resource. If present at both the field and resource levels, the + field level property will be replaced by the *union* of the two arrays, + with the values specified at the resource level appearing in the same order + *after* those specified at the field level. + +2. A field with an `enum` constraint or an `encoding` property MAY have an + `ordered` property that MUST be a boolean. A value of `true` indicates that + the field should be treated as having an ordinal scale of measurement, with + the ordering given by the order of the field's `enum` array or by the + lexical order of the `encoding` object's keys, with the latter taking + precedence. Fields without an `enum` constraint or an `encoding` property + or for which the encoding object's keys do not include all values observed + in the data (excluding any values specified in either the field level or + resource level `missingValues` property) SHOULD NOT have an `ordered` + property since in that case the correct ordering of the data is ambiguous. + The absence of an `ordered` property MUST NOT be taken to imply + `ordered: false`. + +3. A field MAY have an `encoding` property that MUST be an object. This + property SHOULD be used to indicate how the values in the data (represented + by the object's keys) are to be labeled or translated (represented by the + corresponding value). The object's keys MAY include values that do not + appear in the data and MAY omit some values that do appear in the data. For + clarity and to avoid unintentional loss of information, the object's values + SHOULD be unique. + +### Suggested implementations + +Note: The use cases below address only *reading data* from a Frictionless data +package; it is assumed that implementations will also provide the ability to +write Frictionless data packages using the schema extensions proposed above. +We suggest two types of implementations: + +1. Additions to the official Python Frictionless Framework to generate + software-specific scripts that may be executed by a specific software + package to read data from a Frictionless data package and create the + appropriate value labels or categoricals, as described below. These + scripts can then be included along with the data in the package itself. + +2. Software-specific extension packages that may be installed to permit users + of that software to read data from a Frictionless data package directly, + automatically creating the appropriate value labels or categoricals as + described below. + +The advantage of (1) is that it doesn't require users to install a package, +which may in some cases be difficult or impossible. The advantage of (2) is +that it provides native support for working with Frictionless data packages, +and may be both easier and faster once the package is installed. We are in the +process of implementing both approaches for Stata; implementations for the +other software listed above are straightforward. + +#### Software that supports value labels (Stata, SAS or SPSS) + +1. In cases where a field has an `enum` constraint but no `encoding` property, + automatically generate a value label mapping the integers 1, 2, 3, ... to + the `enum` values in order, use this to encode the field (thereby changing + its type from `string` to `integer`), and attach the value label to the + field. Provide option to skip automatically dropping field level + `missingValues` and instead add them in order to the end of the value label, + encoded using extended missing values if supported. + +2. In cases where the data are stored in encoded form (e.g., as integers) and + a corresponding `encoding` property is present, and assuming that the keys + in the encoding object are limited to integers and extended missing values + (if supported), use the `encoding` object to generate a value label and + attach it to the field. As with (1), provide option to skip automatically + dropping field level `missingValues` and instead add them in order to the + end of the value label, encoded using extended missing values if supported. + +3. Although none of Stata, SAS or SPSS currently permit designating a specific + variable as ordered, Stata permits attaching arbitrary metadata to + individual variables. Thus, in cases where the `ordered` property is + present, this information can be stored in Stata to inform the analyst and + to permit loss of information when generating Frictionless data packages + from within Stata. + +#### Software that supports categoricals or factors (Pandas, R, Julia) + +1. In cases where a field has an `enum` constraint but no `encoding` property, + automatically define a categorical or factor using the `enum` values in + order, and convert the variable to categorical or factor type using this + definition. Provide option to skip automatically dropping field level + `missingValues` and instead add them in order to the end of the `enum` + values when defining the categorical or factor. + +2. In cases where the data are stored in encoded form (e.g., as integers) and + a corresponding `encoding` property is present, translate the data using + the `encoding` object, define a categorical or factor using the values of + the `encoding` object in lexical order of the keys, and convert the + variable to categorical or factor type using this definition. Provide + option to skip automatically dropping field level `missingValues` and + instead add them to the end of the `encoding` values when defining the + categorical or factor. + +3. In cases where a field has an `ordered` property, use that when defining + the categorical or factor. + +#### All software + +Although the extensions proposed here are intended primarily to support the +use of value labels and categoricals in software that supports them, they also +provide additional functionality when reading data into any software that can +handle tabular data. Specifically: + +1. Field-specific `missingValues`, especially when combined with + `missingValues` at the tabular resource level, provide considerably more + flexibility in specifying missing values that can benefit reading + Frictionless data into any software. + +2. The `encoding` property may be used to support any type of encoding, even + in cases where value labels or categoricals are not being used. For example, + it is standard practice in software for analyzing genetic data to code sex + as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") and + affection status as 0, 1 and 2 (corresponding to "Unknown", "Unaffected" + and "Affected"). In such cases, the `encoding` property may be used to + confirm that the data follow the standard convention or to indicate that + they deviate from it; it may also be used to translate those codes into + human-readable values, if desired. From d03ede0c869bb69f55f2d7a21bdd373949f0ae90 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 12 Aug 2023 07:36:50 -0500 Subject: [PATCH 02/16] Changes suggested by reviewers @khughitt and @peterdesmet (thanks!) --- patterns/README.md | 160 +++++++++++++++++++-------------------------- 1 file changed, 69 insertions(+), 91 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index c64262e5..eead571d 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1099,7 +1099,7 @@ discussed frequently in numerous contexts, and is nearly identical to the patter [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) appearing in this document above. -Our proposal to add a field-specific `ordered` property has been raised +Our proposal to add a field-specific `enumOrdered` property has been raised [here](https://github.com/frictionlessdata/specs/issues/739) and [here](https://github.com/frictionlessdata/specs/issues/156). @@ -1135,10 +1135,10 @@ We propose three extensions: between so-called *system missing values* (e.g., "Not applicable") and other values that you may wish to include in certain tabulations/analyses but exclude from others (e.g., "Don't know" or "Refused"). -2. Add an optional field-specific `ordered` property, which can be used when - contructing a categorical (or factor) to indicate that the variable is +2. Add an optional field-specific `enumOrdered` property, which can be used + when contructing a categorical (or factor) to indicate that the variable is ordinal. -3. Add an optional field-specific `encoding` property for use when data are +3. Add an optional field-specific `enumLabels` property for use when data are stored using integer or other codes rather than using the category labels. This contains an object mapping the codes appearing in the data (keys) to what they mean (values), and can be used by software to construct @@ -1166,7 +1166,7 @@ Here is an example using extensions (1) and (2): "Excellent", ] } - "ordered": true + "enumOrdered": true "missingValues": ["Don't know","Refused"] } ], @@ -1182,11 +1182,11 @@ be created automatically based on the ordering of the values in the `enum` array, and the field level `missingValues` can be incorporated into the value labels or categoricals if desired. In those cases where it is desired to have more control over how the value labels are constructed, this information can -be stored in a separate encodings file in JSON format or as part of a custom -extension to the table schema. Since such instructions do not describe the -data themselves (but only how a specific software package should handle them), -and since they are often software- and/or user-specific, we argue that they -should not be included in the official table schema. +be stored in a separate file in JSON format or as part of a custom extension +to the table schema. Since such instructions do not describe the data +themselves (but only how a specific software package should handle them), and +since they are often software- and/or user-specific, we argue that they should +not be included in the official table schema. Alternatively, those who wish to store their data in encoded form (e.g., this is the default for data exports from [REDCap](https://projectredcap.org), a @@ -1200,9 +1200,9 @@ extension (3) to do so: "name": "physical_health", "type": "integer", "enum": [1,2,3,4,5] - "ordered": true + "enumOrdered": true "missingValues": ["Don't know","Refused"] - "encoding": { + "enumLabels": { "1": "Poor", "2": "Fair", "3": "Good", @@ -1215,9 +1215,9 @@ extension (3) to do so: } ``` -Note that although the field type is `integer`, the keys in the encoding -object must be enclosed in double quotes because this is required by the JSON -specification. +Note that although the field type is `integer`, the keys in the `enumLabels` +object must be wrapped in double quotes because this is required by the JSON +file format. A second variant of the example above is the following: @@ -1228,9 +1228,9 @@ A second variant of the example above is the following: "name": "physical_health", "type": "integer", "enum": [1,2,3,4,5] - "ordered": true + "enumOrdered": true "missingValues": [".a",".b"] - "encoding": { + "enumLabels": { "1": "Poor", "2": "Fair", "3": "Good", @@ -1251,30 +1251,6 @@ labels. The values `.a`, `.b`, etc. are known as *extended missing values* (both integer and float) in addition to the system missing value ("`.`"); in SPSS these would be replaced with designated numbers (e.g., -97, -98 and -99). -Note that one might argue that the encoding property should instead be -specified as: - -``` -{ - "encoding": { - "Poor": 1, - "Fair": 2, - "Good": 3, - "Very good": 4, - "Excellent": 5 -} -``` - -since that represents the encoding that has been applied to the data, and the -table in the example is what is now necessary to *decode* the data. However, -there are at least three arguments in favor of the proposed specification. -First, it is the way value labels are uniformly written (e.g., in Stata, SAS -and SPSS). Second, it automatically imposes the necessary constraint that the -codes are unique (since a JSON object's keys must be unique). Third, it -simplifies working with the encoding programmatically, since it can be read as -an associative array and then applied directly to decode to the data (e.g., -using `DataFrame.replace()` in Pandas). - ### Specification 1. A field MAY have a `missingValues` property that MUST be an `array` where @@ -1285,26 +1261,27 @@ using `DataFrame.replace()` in Pandas). with the values specified at the resource level appearing in the same order *after* those specified at the field level. -2. A field with an `enum` constraint or an `encoding` property MAY have an - `ordered` property that MUST be a boolean. A value of `true` indicates that - the field should be treated as having an ordinal scale of measurement, with - the ordering given by the order of the field's `enum` array or by the - lexical order of the `encoding` object's keys, with the latter taking - precedence. Fields without an `enum` constraint or an `encoding` property - or for which the encoding object's keys do not include all values observed +2. A field with an `enum` constraint or an `enumLabels` property MAY have an + `enumOrdered` property that MUST be a boolean. A value of `true` indicates + that the field should be treated as having an ordinal scale of measurement, + with the ordering given by the order of the field's `enum` array or by the + lexical order of the `enumLabels` object's keys, with the latter taking + precedence. Fields without an `enum` constraint or an `enumLabels` property + or for which the `enumLabels` keys do not include all values observed in the data (excluding any values specified in either the field level or - resource level `missingValues` property) SHOULD NOT have an `ordered` + resource level `missingValues` property) MUST NOT have an `enumOrdered` property since in that case the correct ordering of the data is ambiguous. - The absence of an `ordered` property MUST NOT be taken to imply - `ordered: false`. + The absence of an `enumOrdered` property MUST NOT be taken to imply + `enumOrdered: false`. -3. A field MAY have an `encoding` property that MUST be an object. This +3. A field MAY have an `enumLabels` property that MUST be an object. This property SHOULD be used to indicate how the values in the data (represented by the object's keys) are to be labeled or translated (represented by the - corresponding value). The object's keys MAY include values that do not - appear in the data and MAY omit some values that do appear in the data. For - clarity and to avoid unintentional loss of information, the object's values - SHOULD be unique. + corresponding value). As required by the JSON format, the object's keys + must be listed as strings (i.e., wrapped in double quotes). The keys MAY + include values that do not appear in the data and MAY omit some values that + do appear in the data. For clarity and to avoid unintentional loss of + information, the object's values SHOULD be unique. ### Suggested implementations @@ -1333,49 +1310,50 @@ other software listed above are straightforward. #### Software that supports value labels (Stata, SAS or SPSS) -1. In cases where a field has an `enum` constraint but no `encoding` property, - automatically generate a value label mapping the integers 1, 2, 3, ... to - the `enum` values in order, use this to encode the field (thereby changing - its type from `string` to `integer`), and attach the value label to the - field. Provide option to skip automatically dropping field level - `missingValues` and instead add them in order to the end of the value label, - encoded using extended missing values if supported. +1. In cases where a field has an `enum` constraint but no `enumLabels` + property, automatically generate a value label mapping the integers 1, 2, + 3, ... to the `enum` values in order, use this to encode the field (thereby + changing its type from `string` to `integer`), and attach the value label + to the field. Provide option to skip automatically dropping field level + `missingValues` and instead add them in order to the end of the value + label, encoded using extended missing values if supported. 2. In cases where the data are stored in encoded form (e.g., as integers) and - a corresponding `encoding` property is present, and assuming that the keys - in the encoding object are limited to integers and extended missing values - (if supported), use the `encoding` object to generate a value label and - attach it to the field. As with (1), provide option to skip automatically - dropping field level `missingValues` and instead add them in order to the - end of the value label, encoded using extended missing values if supported. + a corresponding `enumLabels` property is present, and assuming that the + keys in the `enumLabels` object are limited to integers and extended + missing values (if supported), use the `enumLabels` object to generate a + value label and attach it to the field. As with (1), provide option to skip + automatically dropping field level `missingValues` and instead add them in + order to the end of the value label, encoded using extended missing values + if supported. 3. Although none of Stata, SAS or SPSS currently permit designating a specific variable as ordered, Stata permits attaching arbitrary metadata to - individual variables. Thus, in cases where the `ordered` property is + individual variables. Thus, in cases where the `enumOrdered` property is present, this information can be stored in Stata to inform the analyst and to permit loss of information when generating Frictionless data packages from within Stata. #### Software that supports categoricals or factors (Pandas, R, Julia) -1. In cases where a field has an `enum` constraint but no `encoding` property, - automatically define a categorical or factor using the `enum` values in - order, and convert the variable to categorical or factor type using this - definition. Provide option to skip automatically dropping field level - `missingValues` and instead add them in order to the end of the `enum` - values when defining the categorical or factor. +1. In cases where a field has an `enum` constraint but no `enumLabels` + property, automatically define a categorical or factor using the `enum` + values in order, and convert the variable to categorical or factor type + using this definition. Provide option to skip automatically dropping field + level `missingValues` and instead add them in order to the end of the + `enum` values when defining the categorical or factor. 2. In cases where the data are stored in encoded form (e.g., as integers) and - a corresponding `encoding` property is present, translate the data using - the `encoding` object, define a categorical or factor using the values of - the `encoding` object in lexical order of the keys, and convert the + a corresponding `enumLabels` property is present, translate the data using + the `enumLabels` object, define a categorical or factor using the values of + the `enumLabels` object in lexical order of the keys, and convert the variable to categorical or factor type using this definition. Provide option to skip automatically dropping field level `missingValues` and - instead add them to the end of the `encoding` values when defining the + instead add them to the end of the `enumLabels` values when defining the categorical or factor. -3. In cases where a field has an `ordered` property, use that when defining - the categorical or factor. +3. In cases where a field has an `enumOrdered` property, use that when + defining the categorical or factor. #### All software @@ -1389,12 +1367,12 @@ handle tabular data. Specifically: flexibility in specifying missing values that can benefit reading Frictionless data into any software. -2. The `encoding` property may be used to support any type of encoding, even - in cases where value labels or categoricals are not being used. For example, - it is standard practice in software for analyzing genetic data to code sex - as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") and - affection status as 0, 1 and 2 (corresponding to "Unknown", "Unaffected" - and "Affected"). In such cases, the `encoding` property may be used to - confirm that the data follow the standard convention or to indicate that - they deviate from it; it may also be used to translate those codes into - human-readable values, if desired. +2. The `enumLabels` property may be used to support the use of enums even + in cases where value labels or categoricals are not being used. For + example, it is standard practice in software for analyzing genetic data to + code sex as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") + and affection status as 0, 1 and 2 (corresponding to "Unknown", + "Unaffected" and "Affected"). In such cases, the `enumLabels` property may + be used to confirm that the data follow the standard convention or to + indicate that they deviate from it; it may also be used to translate those + codes into human-readable values, if desired. From ed2e9782f3b2eaac43a3011c5721d3b753d87332 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 12 Aug 2023 08:09:21 -0500 Subject: [PATCH 03/16] Group enum-related properties next to each other --- patterns/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index eead571d..fc08eb9a 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1201,7 +1201,6 @@ extension (3) to do so: "type": "integer", "enum": [1,2,3,4,5] "enumOrdered": true - "missingValues": ["Don't know","Refused"] "enumLabels": { "1": "Poor", "2": "Fair", @@ -1209,6 +1208,7 @@ extension (3) to do so: "4": "Very good", "5": "Excellent" } + "missingValues": ["Don't know","Refused"] } ], "missingValues": ["Not applicable","No answer"] @@ -1229,7 +1229,6 @@ A second variant of the example above is the following: "type": "integer", "enum": [1,2,3,4,5] "enumOrdered": true - "missingValues": [".a",".b"] "enumLabels": { "1": "Poor", "2": "Fair", @@ -1239,6 +1238,7 @@ A second variant of the example above is the following: ".a": "Don't know", ".b": "Refused" } + "missingValues": [".a",".b"] } ], "missingValues": ["."] From b3aa02f39c6dcd07f3efb7224be0d8684a224994 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 16 Sep 2023 16:48:04 -0500 Subject: [PATCH 04/16] Clarify proposed use of field-specific missingValues property --- patterns/README.md | 42 ++++++++++++++++++++++++++++++------------ 1 file changed, 30 insertions(+), 12 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index fc08eb9a..f5d102a5 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1095,7 +1095,7 @@ under a separate proposal. ### Implementations We note that our proposal regarding field-specific missing values has been -discussed frequently in numerous contexts, and is nearly identical to the pattern +discussed frequently in numerous contexts, and is similar to the pattern [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) appearing in this document above. @@ -1124,17 +1124,20 @@ to support internal implementations (e.g., our group is doing so). We propose three extensions: -1. Add an optional field-specific `missingValues` property. This is necessary - so that such values can be included in the definition of a categorical - (e.g., `["Yes", "No", "Don't know", "Refused"]`) or a value label, but - still ignored by software without such features. Note that unlike the - [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) - pattern above, we propose that field-specific missing values be *added* to - the values appearing in the `missingValues` property at the resource level, - rather than replacing them. This is so that software can distinguish - between so-called *system missing values* (e.g., "Not applicable") and - other values that you may wish to include in certain tabulations/analyses - but exclude from others (e.g., "Don't know" or "Refused"). +1. Add an optional field-specific `missingValues` property. This may be used + to distinguish between so-called *system missing values* (which may be + listed in the resource level `missingValues` property) and other values + that convey meaning but are typically excluded when fitting statistical + models (which may be specified in the field-specific `missingValues` + property). The latter may be represented by *extended missing values* + (`.a`, `.b`, `.c`, etc.) in Stata and SAS, or by negative integers that are + then designated as missing (e.g., by using the `MISSING VALUES` command in + SPSS). For example, values such as "NA", "Not applicable", ".", etc. may be + specified in the resource level `missingValues` property, while values such + as "Don't know" and "Refused"—often used when generating tabular summaries + and occasionally used when fitting certain statistical models—may be + specified in the corresponding field level `missingValues` property. See + notes regarding this property below. 2. Add an optional field-specific `enumOrdered` property, which can be used when contructing a categorical (or factor) to indicate that the variable is ordinal. @@ -1145,6 +1148,21 @@ We propose three extensions: corresponding value labels or categoricals (when supported) or to translate the values when reading the data. +Important notes regarding field-specific `missingValues` property: + +1. Note that although resource level `missingValues` are converted to `null` + before type-specific string conversion, we are *not* proposing that all + software be immediately required to do the same for field-specific + `missingValues`. This is to ensure backward compatability with current + software. See [this comment](https://github.com/frictionlessdata/specs/pull/844#discussion_r1291539445) + and the surrounding discussion for more details. +2. Unlike the [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) + pattern above, we propose that software that does implement field-specific + missing values *add* them to the values appearing in the `missingValues` + property at the resource level (i.e., `missingValues` cascade), rather than + replacing them (see discussion in favor of cascading + [here](https://github.com/frictionlessdata/specs/issues/551)). + As none of the three proposed properties is part of the current [table schema](https://specs.frictionlessdata.io//table-schema/), the proposed extensions are fully backward compatible. From 1a77c09560290443d4c6e5d0f5b7b4968ec4ac69 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sun, 17 Sep 2023 08:51:06 -0500 Subject: [PATCH 05/16] Remove field-specific missingValues property --- patterns/README.md | 248 ++++++++++++++++++++++----------------------- 1 file changed, 122 insertions(+), 126 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index f5d102a5..d578a9ee 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1057,7 +1057,7 @@ While these features are of limited use in some disciplines, others rely heavily on them (e.g., social sciences, epidemiology, clinical research, etc.). Thus, before these disciplines can begin to use Frictionless in a meaningful way, both the standards and the software tools need to support -these features. This pattern addresses the necessary extensions to the +these features. This pattern addresses necessary extensions to the [table schema](https://specs.frictionlessdata.io//table-schema/). ### Principles @@ -1075,30 +1075,23 @@ they are based: do with the data. Users who want to include the latter may do so within a sub-namespace such as `custom` (e.g., see Issues [#103](https://github.com/frictionlessdata/specs/issues/103) and [#663](https://github.com/frictionlessdata/specs/issues/663)). -3. Extensions should be feature-complete (i.e., they should permit full - support of value labels, categoricals and factors by software tools). -4. Extensions must be backward compatible (i.e., not break existing tools, +3. Extensions must be backward compatible (i.e., not break existing tools, workflows, etc. for working with Frictionless packages). It is worth emphasizing that the scope of the proposed extensions is strictly -limited to the information necessaary to make full use of the features for -working with categorical data provided by the software packages listed above. -Previous discussions of this issue have occasionally included references to -additional variable-level metadata (e.g., multiple sets of category labels -such as both "short labels" and longer "descriptions", or links to common data -elements, controlled vocabularies or rdfTypes). While these additional -metadata are undoubtedly useful, we speculate that the large majority of users -who would benefit from the extensions propopsed here would not have and/or -utilize such information, and therefore argue that these should be considered -under a separate proposal. +limited to the information necessaary to make use of the features for working +with categorical data provided by the software packages listed above. Previous +discussions of this issue have occasionally included references to additional +variable-level metadata (e.g., multiple sets of category labels such as both +"short labels" and longer "descriptions", or links to common data elements, +controlled vocabularies or rdfTypes). While these additional metadata are +undoubtedly useful, we speculate that the large majority of users who would +benefit from the extensions propopsed here would not have and/or utilize such +information, and therefore argue that these should be considered under a +separate proposal. ### Implementations -We note that our proposal regarding field-specific missing values has been -discussed frequently in numerous contexts, and is similar to the pattern -[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) -appearing in this document above. - Our proposal to add a field-specific `enumOrdered` property has been raised [here](https://github.com/frictionlessdata/specs/issues/739) and [here](https://github.com/frictionlessdata/specs/issues/156). @@ -1122,52 +1115,23 @@ to support internal implementations (e.g., our group is doing so). ### Proposed extensions -We propose three extensions: - -1. Add an optional field-specific `missingValues` property. This may be used - to distinguish between so-called *system missing values* (which may be - listed in the resource level `missingValues` property) and other values - that convey meaning but are typically excluded when fitting statistical - models (which may be specified in the field-specific `missingValues` - property). The latter may be represented by *extended missing values* - (`.a`, `.b`, `.c`, etc.) in Stata and SAS, or by negative integers that are - then designated as missing (e.g., by using the `MISSING VALUES` command in - SPSS). For example, values such as "NA", "Not applicable", ".", etc. may be - specified in the resource level `missingValues` property, while values such - as "Don't know" and "Refused"—often used when generating tabular summaries - and occasionally used when fitting certain statistical models—may be - specified in the corresponding field level `missingValues` property. See - notes regarding this property below. -2. Add an optional field-specific `enumOrdered` property, which can be used +We propose two extensions: + +1. Add an optional field-specific `enumOrdered` property, which can be used when contructing a categorical (or factor) to indicate that the variable is ordinal. -3. Add an optional field-specific `enumLabels` property for use when data are +2. Add an optional field-specific `enumLabels` property for use when data are stored using integer or other codes rather than using the category labels. This contains an object mapping the codes appearing in the data (keys) to what they mean (values), and can be used by software to construct corresponding value labels or categoricals (when supported) or to translate the values when reading the data. -Important notes regarding field-specific `missingValues` property: - -1. Note that although resource level `missingValues` are converted to `null` - before type-specific string conversion, we are *not* proposing that all - software be immediately required to do the same for field-specific - `missingValues`. This is to ensure backward compatability with current - software. See [this comment](https://github.com/frictionlessdata/specs/pull/844#discussion_r1291539445) - and the surrounding discussion for more details. -2. Unlike the [missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field) - pattern above, we propose that software that does implement field-specific - missing values *add* them to the values appearing in the `missingValues` - property at the resource level (i.e., `missingValues` cascade), rather than - replacing them (see discussion in favor of cascading - [here](https://github.com/frictionlessdata/specs/issues/551)). - -As none of the three proposed properties is part of the current +As neither of these proposed properties is part of the current [table schema](https://specs.frictionlessdata.io//table-schema/), the proposed extensions are fully backward compatible. -Here is an example using extensions (1) and (2): +Here is an example of a categorical variable using extension (1): ``` { @@ -1185,31 +1149,30 @@ Here is an example using extensions (1) and (2): ] } "enumOrdered": true - "missingValues": ["Don't know","Refused"] } ], - "missingValues": ["Not applicable","No answer"] + "missingValues": ["Don't know","Refused","Not applicable"] } ``` This is our preferred strategy, as it provides all of the information -necessary to support fully the categorical functionality of the software -packages listed above, while still yielding a useable result for software -without such capability. As described below, value labels or categoricals can -be created automatically based on the ordering of the values in the `enum` -array, and the field level `missingValues` can be incorporated into the value -labels or categoricals if desired. In those cases where it is desired to have -more control over how the value labels are constructed, this information can -be stored in a separate file in JSON format or as part of a custom extension -to the table schema. Since such instructions do not describe the data -themselves (but only how a specific software package should handle them), and -since they are often software- and/or user-specific, we argue that they should -not be included in the official table schema. +necessary to support the categorical functionality of the software packages +listed above, while still yielding a useable result for software without such +capability. As described below, value labels or categoricals can be created +automatically based on the ordering of the values in the `enum` array, and the +`missingValues` can be incorporated into the value labels or categoricals if +desired. In those cases where it is desired to have more control over how the +value labels are constructed, this information can be stored in a separate +file in JSON format or as part of a custom extension to the table schema. +Since such instructions do not describe the data themselves (but only how a +specific software package should handle them), and since they are often +software- and/or user-specific, we argue that they should not be included in +the official table schema. Alternatively, those who wish to store their data in encoded form (e.g., this is the default for data exports from [REDCap](https://projectredcap.org), a commonly-used platform for collecting data for clinical studies) may use -extension (3) to do so: +extension (2) to do so: ``` { @@ -1226,10 +1189,9 @@ extension (3) to do so: "4": "Very good", "5": "Excellent" } - "missingValues": ["Don't know","Refused"] } ], - "missingValues": ["Not applicable","No answer"] + "missingValues": ["Don't know","Refused","Not applicable"] } ``` @@ -1254,12 +1216,12 @@ A second variant of the example above is the following: "4": "Very good", "5": "Excellent", ".a": "Don't know", - ".b": "Refused" + ".b": "Refused", + ".c": "Not applicable" } - "missingValues": [".a",".b"] } ], - "missingValues": ["."] + "missingValues": [".a",".b",".c"] } ``` @@ -1267,32 +1229,24 @@ This represents encoded data exported from software with support for value labels. The values `.a`, `.b`, etc. are known as *extended missing values* (Stata and SAS only) and provide 26 unique missing values for numeric fields (both integer and float) in addition to the system missing value ("`.`"); in -SPSS these would be replaced with designated numbers (e.g., -97, -98 and -99). +SPSS these would be replaced with specially designated integers, typically +negative (e.g., -97, -98 and -99). ### Specification -1. A field MAY have a `missingValues` property that MUST be an `array` where - each entry is a `string`. If not specified, each field shall inherit the - entries in the `missingValues` property at the level of the tabular data - resource. If present at both the field and resource levels, the - field level property will be replaced by the *union* of the two arrays, - with the values specified at the resource level appearing in the same order - *after* those specified at the field level. - -2. A field with an `enum` constraint or an `enumLabels` property MAY have an +1. A field with an `enum` constraint or an `enumLabels` property MAY have an `enumOrdered` property that MUST be a boolean. A value of `true` indicates that the field should be treated as having an ordinal scale of measurement, with the ordering given by the order of the field's `enum` array or by the lexical order of the `enumLabels` object's keys, with the latter taking precedence. Fields without an `enum` constraint or an `enumLabels` property or for which the `enumLabels` keys do not include all values observed - in the data (excluding any values specified in either the field level or - resource level `missingValues` property) MUST NOT have an `enumOrdered` - property since in that case the correct ordering of the data is ambiguous. - The absence of an `enumOrdered` property MUST NOT be taken to imply - `enumOrdered: false`. + in the data (excluding any values specified in the `missingValues` + property) MUST NOT have an `enumOrdered` property since in that case the + correct ordering of the data is ambiguous. The absence of an `enumOrdered` + property MUST NOT be taken to imply `enumOrdered: false`. -3. A field MAY have an `enumLabels` property that MUST be an object. This +2. A field MAY have an `enumLabels` property that MUST be an object. This property SHOULD be used to indicate how the values in the data (represented by the object's keys) are to be labeled or translated (represented by the corresponding value). As required by the JSON format, the object's keys @@ -1319,12 +1273,12 @@ We suggest two types of implementations: automatically creating the appropriate value labels or categoricals as described below. -The advantage of (1) is that it doesn't require users to install a package, -which may in some cases be difficult or impossible. The advantage of (2) is -that it provides native support for working with Frictionless data packages, -and may be both easier and faster once the package is installed. We are in the -process of implementing both approaches for Stata; implementations for the -other software listed above are straightforward. +The advantage of (1) is that it doesn't require users to install another +software package, which may in some cases be difficult or impossible. The +advantage of (2) is that it provides native support for working with +Frictionless data packages, and may be both easier and faster once the package +is installed. We are in the process of implementing both approaches for Stata; +implementations for the other software listed above are straightforward. #### Software that supports value labels (Stata, SAS or SPSS) @@ -1332,24 +1286,25 @@ other software listed above are straightforward. property, automatically generate a value label mapping the integers 1, 2, 3, ... to the `enum` values in order, use this to encode the field (thereby changing its type from `string` to `integer`), and attach the value label - to the field. Provide option to skip automatically dropping field level - `missingValues` and instead add them in order to the end of the value - label, encoded using extended missing values if supported. + to the field. Provide option to skip automatically dropping values + specified in the `missingValues` property and instead add them in order to + the end of the value label, encoded using extended missing values if + supported. 2. In cases where the data are stored in encoded form (e.g., as integers) and a corresponding `enumLabels` property is present, and assuming that the keys in the `enumLabels` object are limited to integers and extended missing values (if supported), use the `enumLabels` object to generate a value label and attach it to the field. As with (1), provide option to skip - automatically dropping field level `missingValues` and instead add them in - order to the end of the value label, encoded using extended missing values - if supported. + automatically dropping values specified in the `missingValues` property and + instead add them in order to the end of the value label, encoded using + extended missing values if supported. -3. Although none of Stata, SAS or SPSS currently permit designating a specific +3. Although none of Stata, SAS or SPSS currently permits designating a specific variable as ordered, Stata permits attaching arbitrary metadata to individual variables. Thus, in cases where the `enumOrdered` property is present, this information can be stored in Stata to inform the analyst and - to permit loss of information when generating Frictionless data packages + to prevent loss of information when generating Frictionless data packages from within Stata. #### Software that supports categoricals or factors (Pandas, R, Julia) @@ -1357,18 +1312,18 @@ other software listed above are straightforward. 1. In cases where a field has an `enum` constraint but no `enumLabels` property, automatically define a categorical or factor using the `enum` values in order, and convert the variable to categorical or factor type - using this definition. Provide option to skip automatically dropping field - level `missingValues` and instead add them in order to the end of the - `enum` values when defining the categorical or factor. + using this definition. Provide option to skip automatically dropping values + specified in the `missingValues` property and instead add them in order to + the end of the `enum` values when defining the categorical or factor. 2. In cases where the data are stored in encoded form (e.g., as integers) and a corresponding `enumLabels` property is present, translate the data using the `enumLabels` object, define a categorical or factor using the values of the `enumLabels` object in lexical order of the keys, and convert the variable to categorical or factor type using this definition. Provide - option to skip automatically dropping field level `missingValues` and - instead add them to the end of the `enumLabels` values when defining the - categorical or factor. + option to skip automatically dropping values specified in the + `missingValues` property and instead add them to the end of the + `enumLabels` values when defining the categorical or factor. 3. In cases where a field has an `enumOrdered` property, use that when defining the categorical or factor. @@ -1378,19 +1333,60 @@ other software listed above are straightforward. Although the extensions proposed here are intended primarily to support the use of value labels and categoricals in software that supports them, they also provide additional functionality when reading data into any software that can -handle tabular data. Specifically: - -1. Field-specific `missingValues`, especially when combined with - `missingValues` at the tabular resource level, provide considerably more - flexibility in specifying missing values that can benefit reading - Frictionless data into any software. - -2. The `enumLabels` property may be used to support the use of enums even - in cases where value labels or categoricals are not being used. For - example, it is standard practice in software for analyzing genetic data to - code sex as 0, 1 and 2 (corresponding to "Unknown", "Male" and "Female") - and affection status as 0, 1 and 2 (corresponding to "Unknown", - "Unaffected" and "Affected"). In such cases, the `enumLabels` property may - be used to confirm that the data follow the standard convention or to - indicate that they deviate from it; it may also be used to translate those - codes into human-readable values, if desired. +handle tabular data. Specifically, the `enumLabels` property may be used to +support the use of enums even in cases where value labels or categoricals are +not being used. For example, it is standard practice in software for analyzing +genetic data to code sex as 0, 1 and 2 (corresponding to "Unknown", "Male" and +"Female") and affection status as 0, 1 and 2 (corresponding to "Unknown", +"Unaffected" and "Affected"). In such cases, the `enumLabels` property may be +used to confirm that the data follow the standard convention or to indicate +that they deviate from it; it may also be used to translate those codes into +human-readable values, if desired. + +### Notes + +This pattern originally included a proposal to add an optional field-specific +`missingValues` property similar to that described in the pattern +"[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" +appearing in this document above. The objective was to provide a mechnanism to +distinguish between so-called *system missing values* (i.e., values that +indicate only that the corresponding data are missing) and other values that +convey meaning but are typically excluded when fitting statistical models. The +latter may be represented by *extended missing values* (`.a`, `.b`, `.c`, +etc.) in Stata and SAS, or in SPSS by negative integers that are then +designated as missing by using the `MISSING VALUES` command. For example, +values such as "NA", "Not applicable", ".", etc. could be specified in the +resource level `missingValues` property, while values such as "Don't know" and +"Refused"—often used when generating tabular summaries and occasionally used +when fitting certain statistical models—could be specified in the +corresponding field level `missingValues` property. The former would still be +converted to `null` before type-specific string conversion (just as they are +now), while the latter could be used by capable software when creating value +labels or categoricals. + +While this proposal was consistent with the principles outlined at the +beginning (in particular, existing software would still yield a usable result +when reading the data), we realized that it would conflict with what appears +to be an emerging consensus regarding field-specific `missingValues`; i.e., +that they should *replace* the less specific resource level `missingValues` +for the corresponding field rather than be combined with them (see the discussion +[here](https://github.com/frictionlessdata/specs/issues/551) as well as the +"[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" +pattern above). While there is good reason for replacing rather than combining +here (e.g., it is more explicit), it would unfortunately conflict with the +idea of using the field-specific `missingValues` in conjunction with the +resource level `missingValues` as just described; namely, if the +field-specific property replaced the resource level property then the system +missing values would no longer be converted to `null`, as desired. + +For this reason, we have dropped the proposal to add a field-specific +`missingValues` property from this pattern, and assert that implementation of +this pattern by software should assume that if a field-specific `missingValues` +property is added to the +[table schema](https://specs.frictionlessdata.io//table-schema/) +it should, if present, replace the resource level `missingValues` property for +the corresponding field. We do not believe that this change represents a +substantial limitation when creating value labels or categoricals, since +system missing values can typically be easily distinguished from other missing +values when exported in CSV format (e.g., "." in Stata or SAS, "NA" in R, or +"" in Pandas). From 1df21187679ba4a5fb0332e4a14861e7531bb656 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:00:31 -0500 Subject: [PATCH 06/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index d578a9ee..758d5e9f 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1031,7 +1031,7 @@ A field MAY have a `missingValues` property that MUST be an `array` where each e None known. -## Facilitate use of value labels (Stata, SAS and SPSS), categoricals (Python) and factors (R) in software that supports them +## Table Schema: Enum labels and ordering ### Overview From 04dc0704c4e2af9e3d39b43f1d44c3b51fbeb1c6 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:03:19 -0500 Subject: [PATCH 07/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index 758d5e9f..4ddb1aff 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1058,7 +1058,7 @@ heavily on them (e.g., social sciences, epidemiology, clinical research, etc.). Thus, before these disciplines can begin to use Frictionless in a meaningful way, both the standards and the software tools need to support these features. This pattern addresses necessary extensions to the -[table schema](https://specs.frictionlessdata.io//table-schema/). +[Table Schema](https://specs.frictionlessdata.io//table-schema/). ### Principles From 422c6b84dbe4756c0a47897b29f2925f1071f6e2 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:03:48 -0500 Subject: [PATCH 08/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index 4ddb1aff..7f3ada3f 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1079,7 +1079,7 @@ they are based: workflows, etc. for working with Frictionless packages). It is worth emphasizing that the scope of the proposed extensions is strictly -limited to the information necessaary to make use of the features for working +limited to the information necessary to make use of the features for working with categorical data provided by the software packages listed above. Previous discussions of this issue have occasionally included references to additional variable-level metadata (e.g., multiple sets of category labels such as both From 0510db338d47c23bbcd9e2cdfd37aad7684fe27e Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:04:44 -0500 Subject: [PATCH 09/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index 7f3ada3f..6f9649fa 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1115,7 +1115,7 @@ to support internal implementations (e.g., our group is doing so). ### Proposed extensions -We propose two extensions: +We propose two extensions to [Table Schema](https://specs.frictionlessdata.io/table-schema/): 1. Add an optional field-specific `enumOrdered` property, which can be used when contructing a categorical (or factor) to indicate that the variable is From 6dd857ac4d0fc22caae71074f4518310dc2c5a10 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:05:12 -0500 Subject: [PATCH 10/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index 6f9649fa..b1ff84be 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1127,7 +1127,7 @@ We propose two extensions to [Table Schema](https://specs.frictionlessdata.io/ta corresponding value labels or categoricals (when supported) or to translate the values when reading the data. -As neither of these proposed properties is part of the current +These extensions are fully backward compatible, since they are optional and [table schema](https://specs.frictionlessdata.io//table-schema/), the proposed extensions are fully backward compatible. From 5291b6d48c0cbb0039d7fe4b4aa0094612ec2686 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:05:35 -0500 Subject: [PATCH 11/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index b1ff84be..a2ec0b56 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1128,7 +1128,7 @@ We propose two extensions to [Table Schema](https://specs.frictionlessdata.io/ta the values when reading the data. These extensions are fully backward compatible, since they are optional and -[table schema](https://specs.frictionlessdata.io//table-schema/), the proposed +not providing them is valid. extensions are fully backward compatible. Here is an example of a categorical variable using extension (1): From a185de40266b6521ee1c9717699449c09bedf7d1 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:05:51 -0500 Subject: [PATCH 12/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index a2ec0b56..4309a8f8 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1129,7 +1129,6 @@ We propose two extensions to [Table Schema](https://specs.frictionlessdata.io/ta These extensions are fully backward compatible, since they are optional and not providing them is valid. -extensions are fully backward compatible. Here is an example of a categorical variable using extension (1): From e38433dbe4fefe25498401e871231f83ac4c03c6 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:06:37 -0500 Subject: [PATCH 13/16] Update patterns/README.md Co-authored-by: Peter Desmet --- patterns/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/patterns/README.md b/patterns/README.md index 4309a8f8..015edaf9 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1344,6 +1344,8 @@ human-readable values, if desired. ### Notes +While this pattern is designed as an extension to [Table Schema](https://specs.frictionlessdata.io/table-schema/) fields, it could also be used to document `enum` values of properties in [profiles](https://specs.frictionlessdata.io/profiles/), such as contributor roles. + This pattern originally included a proposal to add an optional field-specific `missingValues` property similar to that described in the pattern "[missing values per field](https://specs.frictionlessdata.io/patterns/#missing-values-per-field)" From 5d82cf7177d1a043b9b38dd85a2a775a0c4ecc43 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 30 Sep 2023 06:20:32 -0500 Subject: [PATCH 14/16] Minor edit to eliminate ambiguity --- patterns/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/README.md b/patterns/README.md index 015edaf9..84b66a9a 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1111,7 +1111,7 @@ Finally, while we are unaware of any existing implementations intended for general use, it is likely that many users are already exploiting the fact that arbitrary fields may be added to the [table schema](https://specs.frictionlessdata.io//table-schema/) -to support internal implementations (e.g., our group is doing so). +to support internal implementations. ### Proposed extensions From de004ba9e9729932f7cba91ef46c263b8aeae2b4 Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 14 Oct 2023 07:29:19 -0500 Subject: [PATCH 15/16] Fix syntax (thanks @khusmann!) --- patterns/README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index 84b66a9a..0287fb4f 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1179,7 +1179,9 @@ extension (2) to do so: { "name": "physical_health", "type": "integer", - "enum": [1,2,3,4,5] + "constraints": { + "enum": [1,2,3,4,5] + } "enumOrdered": true "enumLabels": { "1": "Poor", @@ -1206,7 +1208,9 @@ A second variant of the example above is the following: { "name": "physical_health", "type": "integer", - "enum": [1,2,3,4,5] + "constraints": { + "enum": [1,2,3,4,5] + } "enumOrdered": true "enumLabels": { "1": "Poor", From bb8f09368ea8de075c08294a3d1c6e875f8d578e Mon Sep 17 00:00:00 2001 From: Phil Schumm Date: Sat, 14 Oct 2023 07:53:27 -0500 Subject: [PATCH 16/16] Add missing comma --- patterns/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/patterns/README.md b/patterns/README.md index 0287fb4f..23c88987 100644 --- a/patterns/README.md +++ b/patterns/README.md @@ -1182,7 +1182,7 @@ extension (2) to do so: "constraints": { "enum": [1,2,3,4,5] } - "enumOrdered": true + "enumOrdered": true, "enumLabels": { "1": "Poor", "2": "Fair", @@ -1211,7 +1211,7 @@ A second variant of the example above is the following: "constraints": { "enum": [1,2,3,4,5] } - "enumOrdered": true + "enumOrdered": true, "enumLabels": { "1": "Poor", "2": "Fair",