Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculate_distance() possible improvement... #9

Open
coforfe opened this issue Jun 7, 2020 · 0 comments
Open

calculate_distance() possible improvement... #9

coforfe opened this issue Jun 7, 2020 · 0 comments

Comments

@coforfe
Copy link

coforfe commented Jun 7, 2020

Hi Przemek,

For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.

Instead of using rank() here:

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
    after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
  }

It would improve a lot if you use frank() from data.table package.

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
   after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
  }

Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.

Thanks,
Carlos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant