You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that it is very slow with many small groups and when my_var is a factor with many levels. Converting it to an int or a character makes things almost instantaneous.
The reprex below shows that the slowness is due to vec_unique_count and not the wrapper code in dplyr's n_distinct. Possible explanation (not sure): for every call to the function on a small group, all the factor levels must be handled by C++ code, big overhead.
As mentioned above, a workaround is to pre-convert the variable to int. But wouldn't it make sense to do this factor->int conversion within vec_unique_count ?
Reprex (R 4.1.2 on Windows 10, all packages at their latest CRAN version):
library(vctrs)
# Build simulated data ####
set.seed(1)
n<-1e5# Many groups, exponentially sizedgrp<- floor(rexp(n, 10/n))
# val_chr has many distinct value altogether, but only a few per group val_chr<- paste0("v_", sample.int(n, replace=TRUE))
# Convert val_chr to other data typesval_fct<-factor(val_chr)
val_int<- as.integer(val_fct)
# Per-group count distinct ####
system.time(res_chr<- tapply(val_chr, grp, vec_unique_count))
#> user system elapsed #> 0.33 0.00 0.34
system.time(res_int<- tapply(val_int, grp, vec_unique_count))
#> user system elapsed #> 0.31 0.00 0.33
system.time(res_fct<- tapply(val_fct, grp, vec_unique_count))
#> user system elapsed #> 27.86 0.05 28.66# --> Perf problem here: about 100x slower for factor than character or int# Difference is even worse with length(unique(v))
system.time(res_chr_base<- tapply(val_chr, grp, function(x) length(unique(x))))
#> user system elapsed #> 0.40 0.00 0.41
system.time(res_int_base<- tapply(val_int, grp, function(x) length(unique(x))))
#> user system elapsed #> 0.36 0.00 0.39# system.time(res_fct_base <- tapply(val_fct, grp, function(x) length(unique(x))))# --> killed after 5min# Some explorations for workarounds: not everything is efficient!
system.time(tapply(val_fct, grp, function(x) vec_unique_count(c(integer(0),x))))
#> user system elapsed #> 0.67 0.00 0.67
system.time(tapply(val_fct, grp, function(x) vec_unique_count(as.integer(x))))
#> user system elapsed #> 5.26 4.19 9.70
system.time(tapply(val_fct, grp, function(x) vec_unique_count(unclass(x))))
#> user system elapsed #> 27.56 0.02 28.28
Hi
I am trying to do the following with dplyr:
I noticed that it is very slow with many small groups and when
my_var
is a factor with many levels. Converting it to an int or a character makes things almost instantaneous.The reprex below shows that the slowness is due to
vec_unique_count
and not the wrapper code in dplyr'sn_distinct
. Possible explanation (not sure): for every call to the function on a small group, all the factor levels must be handled by C++ code, big overhead.As mentioned above, a workaround is to pre-convert the variable to int. But wouldn't it make sense to do this factor->int conversion within
vec_unique_count
?Reprex (R 4.1.2 on Windows 10, all packages at their latest CRAN version):
Created on 2022-05-16 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: