Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with labelled package set_value_labels() + haven_labelled class #488

Closed
calebasaraba opened this issue May 4, 2020 · 8 comments · Fixed by #794
Closed

Integration with labelled package set_value_labels() + haven_labelled class #488

calebasaraba opened this issue May 4, 2020 · 8 comments · Fixed by #794
Milestone

Comments

@calebasaraba
Copy link

Amazing package -- really love this project. I am trying to use it alongside with the labelled package and when using the set_value_labels() function I get an error:

library(tidyverse)
library(gtsummary)
library(labelled)

mtcars %>%
  select(cyl, mpg) %>%
  set_variable_labels(cyl = "Cylinders",
                      mpg = "Miles per gallon") %>%
  set_value_labels(cyl = c("Four" = 4, "Six" = 6, "Eight" = 8)) %>%
  tbl_summary(by = cyl)
Column(s) ‘cyl’ omitted from output.
Accepted classes are ‘character’, ‘factor’, ‘numeric’, ‘logical’, ‘integer’, or ‘difftime’.
Error in class(data[[by]]) <- setdiff(class(data[[by]]), "labelled") : 
  attempt to set an attribute on NULL

Not sure if this intentional behavior for the package, or if it would be an easy fix. Using factor labels for the variable levels (like below) works, but it would be great if gtsummary() would also accept the haven_labelled class, as I'm seeing it used more and more.

mtcars %>%
  select(cyl, mpg) %>%
  set_variable_labels(cyl = "Cylinders",
                      mpg = "Miles per gallon") %>%
  mutate(cyl = factor(cyl, labels = c("Four","Six","Eight"))) %>%
  tbl_summary(by = cyl)

@ddsjoberg
Copy link
Owner

Hello @calebasaraba ! Thank you for the note!!

I will need to put more thought into whether or not to extend gtsummary to accept other classes. The the case of haven labelled, it was never meant to be a class that was used in analysis or data exploration. Rather, it was created as an in-between when importing data from other languages where the data types don't have a one-to-one relationship with R. This is from a tidyverse blogpost about the haven labelled class of variables. (https://haven.tidyverse.org/articles/semantics.html)

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame.

For the time being, I recommend you convert the variables to factor with as_factor() (can be run on the entire data frame) to convert the haven labelled data to factors.

Happy Coding!

@calebasaraba
Copy link
Author

Got it, thanks for the clarification about intended use of the haven_labelled class @ddsjoberg!

I have been enjoying the way set_variable_labels() and set_value_labels() from labelled fit into my workflow (I receive a lot of original data files from SPSS), but it makes a lot of sense to return to factors using as_factor(). I'll close this issue up.

Thanks for your quick response and all your awesome work :)

@karissawhiting
Copy link
Contributor

For now, we are going to add more specific messaging aroundhaven_labelled class to indicate it is not an accepted class and that user can use as_factor() to convert.

@larmarange
Copy link
Collaborator

Just a quick comment, labelled vectors are not always intended to be converted into factors. For example, you could have an age variable and add a label to value 99 to say that 99 represent "99 or more".

This is why it is the responsability of the user to unclass or to convert into a factor, depending on the fact that the variable should be treated as continuous or categorical.

A quick type is to use labelled::unlabelled() who perform a conditional conversion. By default, unlabelled() works as follow:

  • if a column doesn’t inherit the haven_labelled class, it will be not affected;
  • if all observed values have a corresponding value label, the column will be converted into a facter;
  • otherwise, the column will be unclassed (and converted back to a numeric or character vector).

But these hypothesis works only if the users have documented properly the vectors.

More details on https://larmarange.github.io/labelled/articles/intro_labelled.html#conditionnal-conversion-to-factors-1

@ddsjoberg ddsjoberg added this to the 1.3.5 milestone Sep 5, 2020
@ddsjoberg ddsjoberg modified the milestones: 1.3.5, 1.3.6 Oct 1, 2020
@ddsjoberg ddsjoberg modified the milestones: 1.3.6, 1.4.0 Jan 8, 2021
@muminbayoumi
Copy link

Im having a similar issue
I use expss::apply_labels(v1=label1,...) to set labels to my variables(not their values), It seems tbl_summary is unable to pickup the labels for factor variables, ie ones I set explicitly to factor using as_factor(). All other variable types and their labels are being picked up very nicely.

@ddsjoberg
Copy link
Owner

@muminbayoumi can you post an example I can run on my machine? Aka A reprex

@muminbayoumi
Copy link

I'll have to apologise - seems the base issue is with base R and using droplevels function. However this reprex illustrates how the factor levels which empty are still printed . Are you planning on adding an option to exclude those?

library(expss)
library(tidyverse)
library(forcats)
library(gtsummary)
library(sjmisc)



data <- tibble(.rows = 200)
data$CatColumAsFactor <- as_factor(sample(c('Apple','Banana','Cherry'),200,replace = T))
data$CatColumAsCharacter <- sample(c('Apple','Banana','Cherry'),200,replace = T)


data <- apply_labels(data,
                     CatColumAsCharacter='Character Column',
                     CatColumAsFactor= 'Factor Column')

## Without dropping filtered levels all levels printed on factor column 
##  Only levels with values printed on character column
data %>% filter(CatColumAsFactor!='Cherry',CatColumAsCharacter!='Apple')%>%
        tbl_summary()
## On dropping levels
##Label attribute lost and therefore not  picked up  by gtsummary
data %>% filter(CatColumAsFactor!='Cherry',CatColumAsCharacter!='Apple') %>% droplevels() %>%
        tbl_summary()

## to_label preserves the labels
data %>% filter(CatColumAsFactor!='Cherry',) %>% sjmisc::to_label(drop.levels=T) %>%
        tbl_summary()

The tables aren't rendering very well with reprex() function - so I took them out.
Again i am sorry it isn't an issue with gtsummary.
And thanks for this wonderful package.

@ddsjoberg
Copy link
Owner

Thank you for showing me this package! I hadn't heard of exprss before, and it's such a popular package! @muminbayoumi

Showing the unobserved factors is a feature I think is useful. If you want unobserved factors removed, you can remove the levels before passing the data frame to tbl_summary(). There is likely a nice function in forcats to do this, but it can also be done with factor().

data %>%
  filter(CatColumAsFactor != 'Cherry', CatColumAsCharacter != 'Apple') %>%
  mutate_if(is.factor, factor) %>% # removes unobserved levels
  tbl_summary()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants