vec_unique_count slow for factors #1560

pgramme · 2022-05-16T17:36:35Z

Hi

I am trying to do the following with dplyr:

my_df %>% group_by(grp) %>% summarise(nd = n_distinct(my_var))

I noticed that it is very slow with many small groups and when my_var is a factor with many levels. Converting it to an int or a character makes things almost instantaneous.

The reprex below shows that the slowness is due to vec_unique_count and not the wrapper code in dplyr's n_distinct. Possible explanation (not sure): for every call to the function on a small group, all the factor levels must be handled by C++ code, big overhead.

As mentioned above, a workaround is to pre-convert the variable to int. But wouldn't it make sense to do this factor->int conversion within vec_unique_count ?

Reprex (R 4.1.2 on Windows 10, all packages at their latest CRAN version):

library(vctrs)

# Build simulated data ####
set.seed(1)
n <- 1e5
# Many groups, exponentially sized
grp <- floor(rexp(n, 10/n))
# val_chr has many distinct value altogether, but only a few per group 
val_chr <- paste0("v_", sample.int(n, replace = TRUE))
# Convert val_chr to other data types
val_fct <- factor(val_chr)
val_int <- as.integer(val_fct)


# Per-group count distinct ####
system.time(res_chr <- tapply(val_chr, grp, vec_unique_count))
#>    user  system elapsed 
#>    0.33    0.00    0.34
system.time(res_int <- tapply(val_int, grp, vec_unique_count))
#>    user  system elapsed 
#>    0.31    0.00    0.33
system.time(res_fct <- tapply(val_fct, grp, vec_unique_count))
#>    user  system elapsed 
#>   27.86    0.05   28.66
# --> Perf problem here: about 100x slower for factor than character or int

# Difference is even worse with length(unique(v))
system.time(res_chr_base <- tapply(val_chr, grp, function(x) length(unique(x))))
#>    user  system elapsed 
#>    0.40    0.00    0.41
system.time(res_int_base <- tapply(val_int, grp, function(x) length(unique(x))))
#>    user  system elapsed 
#>    0.36    0.00    0.39
# system.time(res_fct_base <- tapply(val_fct, grp, function(x) length(unique(x))))
# --> killed after 5min

# Some explorations for workarounds: not everything is efficient!
system.time(tapply(val_fct, grp, function(x) vec_unique_count(c(integer(0),x))))
#>    user  system elapsed 
#>    0.67    0.00    0.67
system.time(tapply(val_fct, grp, function(x) vec_unique_count(as.integer(x))))
#>    user  system elapsed 
#>    5.26    4.19    9.70
system.time(tapply(val_fct, grp, function(x) vec_unique_count(unclass(x))))
#>    user  system elapsed 
#>   27.56    0.02   28.28

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

pgramme · 2022-05-16T17:42:55Z

Related, but probably different issue: tidyverse/dplyr#5017 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vec_unique_count slow for factors #1560

vec_unique_count slow for factors #1560

pgramme commented May 16, 2022

pgramme commented May 16, 2022

vec_unique_count slow for factors #1560

vec_unique_count slow for factors #1560

Comments

pgramme commented May 16, 2022

pgramme commented May 16, 2022