-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiselect survey item types via categorical
properties on list
field types
#940
Comments
To be honest, I don't have that much experience with these. From what I have seen these are usually stored either as separate 'dummy' columns (so separate column for apple, orange etc with 1 indicating that this category is selected) or in 'long format' where there are multiple rows; one for each category selected. I believe the latter is, for example, used in some of the hospital data we are working with, where patients can have 0 or more subdiagnoses (a dummy variable for each possible icd10 code would be a bit unwieldy 😆 ). But I suspect this is also because most tools don't have direct/easy support for list type fields. In plain JS/Python this is quite easy (don't know about pandas); R also supports these, although working with list types is not easy in base-R. So it is about support and not necessarily about this being or not being a natural way to store this data. What you suggest is only a small deviation from the current spec. Actually reading back the v3 spec: it depends a little bit what the 'logical representation' of a list type is and how to interpret 'The logical representation of data in the field MUST exactly match one of the values in categories.' I would suspect that most humans reading your example would understand what is meant. |
Exactly. To summarize, multiselect items are represented in tabular formats as:
Representations (2) & (3) can be presently captured by the current frictionless spec via boolean columns. In the current v2 spec, representation (1) is only partially supported: we can make lists of
Historically, I think that's true, but in my experience list-columns have more recently become ubiquitous across the current open software landscape. For example, Pandas has a function to explode a list-column of form (1) into exploded rows of form (3). There's also a similar function for list-columns in Polars. List-columns in base R are a little unwieldy as you say, but now have excellent support now in the tidyverse: tibbles of list-columns work seamlessly with purrr maps, and tidyr now provides Implementations that don't support list-columns could also easily include an option to load these fields by transforming them into an exploded form, or leave them as delimited strings. The larger point, I think, is that it's not uncommon to see delimited lists of categoricals for multiselect items (e.g. Redcap & Qualtrics), and so it'd be nice to directly represent this format in a frictionless schema rather than requiring pre-transformation to the data to get it into frictionless. Plus, we're 90% of the way there already via the existing
Exactly. We would rephrase this part of the categorical definition to include a different provision for lists of categoricals. Something like:
There's a few other places like this where we'd need to update the definition to allow for the lists; but in general it would only be a minor deviation from the current spec, as you say. |
Here are some very quick reactions—nothing I would feel strongly about without further thought. Multiselect items are typically pretty bad from a measurement perspective; people often analyze them as though they were independent responses to each of several questions (one for each possible item), but of course the task when you present someone with a list of items and ask them to "Check all that apply" is actually quite different. If you want to know a participant's response to each item, then you really need to ask about each individually. That said, I recognize that there can be legitimate use cases, and even in cases where individual items would have been better, researchers still use multiselect fields (e.g., REDCap includes them as an option), so we need a way to represent them in the data. While systems such as Pandas allow list fields, they are not very helpful when fitting statistical models, as it is difficult to create a model matrix from a list field. Hence, Stata doesn't have list fields—don't know about SAS or SPSS. Stata does have a function to split a list field into Representation (2) above, and it's not difficult to translate to Representation (3) either. But if you are always going to do that before fitting a model, then you don't gain much by using Representation (1). I would point out that, if the items are ["Apple","Orange","Banana","Kiwi"], then there is a big difference between the value "Apple,Banana" and the value "Checked,Not checked,Checked,Not checked"; specifically (1) the Stata function Finally, a very similar issue is how to handle rank choices, including responses with tied ranks. In one sense a multiselect field may be thought of as a special case of a rank choice field (i.e., one with only two ranks, both of which may have ties). Perhaps that's too great a level of abstraction for this purpose. In sum, I don't have any specific objection to what Kyle is proposing above, and I can even think of cases where it would work very nicely (e.g., storing diagnosis codes in medical record data). My only point is that there are instances of multiselect fields (and rank choice fields) that it doesn't cover, and that are important for full support of social, behavioral, and biomedical research. For example, in the case of a multiselect field in REDCap, it is unlikely that I would want to store the resulting data in this way. |
"Multiple select" items are an extremely common type of survey question type in the social / medical / bio-behavioral / etc sciences. For example:
Which fruits do you like? (Select all that apply)
Data from such items are often exported from survey software as a delimited list in a field. Qualtrics will export data like this (in fact, it uses this delimited list form by default), and I believe REDCap has an option for it as well (@pschumm please correct me if I'm wrong!). For example, an exported csv from the above item might look something like this:
For representing these item types in frictionless, I'd like to propose we allow
categorical
properties to be defined onlist
item types (whereitemType
is eitherinteger
orstring
). This way, the above multiple select item field could be represented as follows:Or in a coded representation:
Thoughts from other folks that frequently use categorical items? @pschumm @fomcl @djvanderlaan
The text was updated successfully, but these errors were encountered: