calculate_distance() possible improvement... #9

coforfe · 2020-06-07T10:08:40Z

Hi Przemek,

For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.

Instead of using rank() here:

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
    after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
  }

It would improve a lot if you use frank() from data.table package.

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
   after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
  }

Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.

Thanks,
Carlos.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calculate_distance() possible improvement... #9

calculate_distance() possible improvement... #9

coforfe commented Jun 7, 2020

calculate_distance() possible improvement... #9

calculate_distance() possible improvement... #9

Comments

coforfe commented Jun 7, 2020