Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiselect survey item types via categorical properties on list field types #940

Open
khusmann opened this issue Jun 19, 2024 · 3 comments

Comments

@khusmann
Copy link
Contributor

"Multiple select" items are an extremely common type of survey question type in the social / medical / bio-behavioral / etc sciences. For example:

Which fruits do you like? (Select all that apply)

a. Apple
b. Orange
c. Banana
d. Kiwi

Data from such items are often exported from survey software as a delimited list in a field. Qualtrics will export data like this (in fact, it uses this delimited list form by default), and I believe REDCap has an option for it as well (@pschumm please correct me if I'm wrong!). For example, an exported csv from the above item might look something like this:

id,multiselectField
0,"Apple"
1,"Apple,Orange"
2,"Apple,Banana,Kiwi"

For representing these item types in frictionless, I'd like to propose we allow categorical properties to be defined on list item types (where itemType is either integer or string). This way, the above multiple select item field could be represented as follows:

{
  "name": "multiselectField",
  "type": "list",
  "itemType": "string",
  "categories": ["Apple", "Orange", "Banana", "Kiwi"]
}

Or in a coded representation:

{
  "name": "multiselectField",
  "type": "list",
  "itemType": "integer",
  "categories": [
    { "value": 0, "label": "Apple"},
    { "value": 1, "label": "Orange"},
    { "value": 2, "label": "Banana"},
    { "value": 3, "label": "Kiwi"}
  ]
}

Thoughts from other folks that frequently use categorical items? @pschumm @fomcl @djvanderlaan

@djvanderlaan
Copy link

To be honest, I don't have that much experience with these. From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected. I believe the latter is, for example, used in some of the hospital data we are working with, where patients can have 0 or more subdiagnoses (a dummy variable for each possible icd10 code would be a bit unwieldy 😆 ).

But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data.

What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.' I would suspect that most humans reading your example would understand what is meant.

@khusmann
Copy link
Contributor Author

khusmann commented Jul 9, 2024

From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected.

Exactly. To summarize, multiselect items are represented in tabular formats as:

  1. Delimited lists in a single column (as I'm proposing here; Qualtrics and Redcap support this)
  2. Exploded columns (1 boolean column for each option) (Qualtrics and Redcap also support this)
  3. Exploded rows (like 2, but transformed into 'long' format; not supported by Qualtrics / Redcap, but used elsewhere, as you say)

Representations (2) & (3) can be presently captured by the current frictionless spec via boolean columns. In the current v2 spec, representation (1) is only partially supported: we can make lists of integer or string types, but we cannot define categories for them. This proposal would allow us to define the categories prop on integer and string lists, thereby giving us full support for representation (1).

But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data.

Historically, I think that's true, but in my experience list-columns have more recently become ubiquitous across the current open software landscape. For example, Pandas has a function to explode a list-column of form (1) into exploded rows of form (3). There's also a similar function for list-columns in Polars. List-columns in base R are a little unwieldy as you say, but now have excellent support now in the tidyverse: tibbles of list-columns work seamlessly with purrr maps, and tidyr now provides unnest_wider and unnest_longer to explode list-columns into form (2), and (3) respectively (here's the relevant vignette).

Implementations that don't support list-columns could also easily include an option to load these fields by transforming them into an exploded form, or leave them as delimited strings.

The larger point, I think, is that it's not uncommon to see delimited lists of categoricals for multiselect items (e.g. Redcap & Qualtrics), and so it'd be nice to directly represent this format in a frictionless schema rather than requiring pre-transformation to the data to get it into frictionless. Plus, we're 90% of the way there already via the existing list field type...

What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.'

Exactly. We would rephrase this part of the categorical definition to include a different provision for lists of categoricals. Something like:

When the categorical property is applied to a `list` field type, the logical representation of each element in the list `MUST` exactly match one of the values in categories.

There's a few other places like this where we'd need to update the definition to allow for the lists; but in general it would only be a minor deviation from the current spec, as you say.

@pschumm
Copy link
Contributor

pschumm commented Jul 10, 2024

Here are some very quick reactions—nothing I would feel strongly about without further thought. Multiselect items are typically pretty bad from a measurement perspective; people often analyze them as though they were independent responses to each of several questions (one for each possible item), but of course the task when you present someone with a list of items and ask them to "Check all that apply" is actually quite different. If you want to know a participant's response to each item, then you really need to ask about each individually. That said, I recognize that there can be legitimate use cases, and even in cases where individual items would have been better, researchers still use multiselect fields (e.g., REDCap includes them as an option), so we need a way to represent them in the data.

While systems such as Pandas allow list fields, they are not very helpful when fitting statistical models, as it is difficult to create a model matrix from a list field. Hence, Stata doesn't have list fields—don't know about SAS or SPSS. Stata does have a function to split a list field into Representation (2) above, and it's not difficult to translate to Representation (3) either. But if you are always going to do that before fitting a model, then you don't gain much by using Representation (1).

I would point out that, if the items are ["Apple","Orange","Banana","Kiwi"], then there is a big difference between the value "Apple,Banana" and the value "Checked,Not checked,Checked,Not checked"; specifically (1) the Stata function split will work on the latter but not on the former, (2) the ordering of the categories is ambiguous in the former but not in the latter, and (3) the latter has categories at two levels (i.e., the items and the response options for each) while the former has only one level of categories (i.e., the items). Also, missing values can work (and mean) quite different things in both cases. Thus, I think we should think a bit about whether we want to support both of these, and if not, which one do we prefer?

Finally, a very similar issue is how to handle rank choices, including responses with tied ranks. In one sense a multiselect field may be thought of as a special case of a rank choice field (i.e., one with only two ranks, both of which may have ties). Perhaps that's too great a level of abstraction for this purpose.

In sum, I don't have any specific objection to what Kyle is proposing above, and I can even think of cases where it would work very nicely (e.g., storing diagnosis codes in medical record data). My only point is that there are instances of multiselect fields (and rank choice fields) that it doesn't cover, and that are important for full support of social, behavioral, and biomedical research. For example, in the case of a multiselect field in REDCap, it is unlikely that I would want to store the resulting data in this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants