Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace vroom with data.table::fread #318

Merged
merged 25 commits into from
Nov 12, 2020
Merged

Replace vroom with data.table::fread #318

merged 25 commits into from
Nov 12, 2020

Conversation

rok-cesnovar
Copy link
Member

@rok-cesnovar rok-cesnovar commented Oct 16, 2020

Summary

Replaces vroom with data.table

Closes #299
closes #198
closes #262

Copyright and Licensing

Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar, Univ. of Ljubljana

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

@rok-cesnovar rok-cesnovar changed the title Fread Replace vroom with data.table::fread Oct 16, 2020
@rok-cesnovar rok-cesnovar force-pushed the fread branch 2 times, most recently from f664109 to 760e68c Compare October 17, 2020 17:05
@codecov-io
Copy link

codecov-io commented Oct 17, 2020

Codecov Report

Merging #318 (e1d1c60) into master (962e7c1) will increase coverage by 0.24%.
The diff coverage is 98.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #318      +/-   ##
==========================================
+ Coverage   89.52%   89.77%   +0.24%     
==========================================
  Files          12       12              
  Lines        2588     2563      -25     
==========================================
- Hits         2317     2301      -16     
+ Misses        271      262       -9     
Impacted Files Coverage Δ
R/read_csv.R 98.77% <98.46%> (+2.45%) ⬆️
R/data.R 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 962e7c1...e1d1c60. Read the comment docs.

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good! I made a few tiny comments, but otherwise the main thing is that it's a bit difficult for me to parse (pun intended) the changes to the read_csv_metadata() function from the diff. It's a bit hard to follow. Are there particular lines that I should focus on checking?

Also do you know why the windows unit tests are failing but not mac or linux?

R/data.R Outdated
chain_draws <- posterior::as_draws_df(posterior::subset_draws(draws, chain = chain))
colnames(chain_draws) <- unrepair_variable_names(variables)
chain_draws <- posterior::subset_draws(draws, chain = chain)
unname(chain_draws)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Why do we need to remove the names here? That's fine if necessary, just curious.

  2. Right now this isn't assigned to anything. I think you need

chain_draws <- unname(chain_draws)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove the names here? That's fine if necessary, just curious.

Because it otherwise writes the iteration ids in the CSV.

Right now this isn't assigned to anything. I think you need

Hm, this did help with that, will double-check.

Comment on lines 71 to 89
# test_that("inv_metric method works after mcmc", {
# skip_on_cran()
# x <- fit_mcmc_1$inv_metric()
# expect_length(x, fit_mcmc_1$num_chains())
# checkmate::expect_matrix(x[[1]])
# checkmate::expect_matrix(x[[2]])
# expect_equal(x[[1]], diag(diag(x[[1]])))
#
# x <- fit_mcmc_1$inv_metric(matrix=FALSE)
# expect_length(x, fit_mcmc_1$num_chains())
# expect_null(dim(x[[1]]))
# checkmate::expect_numeric(x[[1]])
# checkmate::expect_numeric(x[[2]])
#
# x <- fit_mcmc_2$inv_metric()
# expect_length(x, fit_mcmc_2$num_chains())
# checkmate::expect_matrix(x[[1]])
# expect_false(x[[1]][1,2] == 0) # dense
# })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these lines commented out on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on macOS machine on the CI. I am currently unable to debug further. If you have a few minutes, mind running this test if it fails for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can check in a few min and let you know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes on my mac

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Thats good news and bad news at the same time :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good and bad. maybe we should let it run on CI again and I can see if I can debug it

@rok-cesnovar
Copy link
Member Author

Are there particular lines that I should focus on checking?

Will explain a bit more once I get all tests passing. The main idea is that we read in all "metadata" lines and parse them then. Previously we read metadata lines line-by-line.

Also do you know why the windows unit tests are failing but not mac or linux?

Its having problems finding grep. grep is a part of the Rtools installation so this is just a CI configuration thing I believe.

@rok-cesnovar
Copy link
Member Author

Status update: still issues with tests. I can use cmdstanr with fread on Windows locally (run a model and read in draws works fine) but cant get tests to pass both locally or in CI. Non-windows side of thing is ready, however.

@jgabry
Copy link
Member

jgabry commented Nov 1, 2020

Thanks for the update. Is the problem on windows still related to failing to find "grep"?

@rok-cesnovar
Copy link
Member Author

Seems so. Though it seems to find it in “normal use” but not when running tests. Should have time look into this a bit more this week and finish this up.

@jgabry
Copy link
Member

jgabry commented Nov 1, 2020

That's strange. Sorry for the hassle, that sounds annoying, but thanks for working on this!

@mike-lawrence
Copy link
Collaborator

You don't have Cygwin installed on your local Windows machine, do you? Here's someone similarly having trouble with the rtools-installed grep, and I see that the data.table maintainer seems to suggest installing grep via Cygwin as the primary way to get it working.

# Conflicts:
#	.github/workflows/Test-coverage.yaml
#	tests/testthat/test-fit-mle.R
#	tests/testthat/test-fit-shared.R
#	tests/testthat/test-model-compile.R
#	tests/testthat/test-model-sample.R
Copy link
Member Author

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eureka!

I think I solved the Windows CI issue and this is ready for review and merge finally. That took way too long.

Posting a few comments to make the review easier. Apart from replacing vroom the biggest change is the refactored metadata read.

tests/testthat/test-fit-gq.R Show resolved Hide resolved
R/read_csv.R Show resolved Hide resolved
R/read_csv.R Show resolved Hide resolved
R/read_csv.R Show resolved Hide resolved
R/read_csv.R Show resolved Hide resolved
@rok-cesnovar
Copy link
Member Author

Here's someone similarly having trouble with the rtools-installed grep, and I see that the data.table maintainer seems to suggest installing grep via Cygwin as the primary way to get it working.

Thanks, @mike-lawrence for the suggestion. But even if this turned out to work, that would not be doable for cmdstanr as it would require users to also have Cygwin installed. I think I managed to find a workaround that works for Windows locally and in CI.

@rok-cesnovar
Copy link
Member Author

I will run some benchmarks tomorrow to double check everything.

@jgabry
Copy link
Member

jgabry commented Nov 11, 2020

Awesome! I'll try to review this soon.

Copy link
Member

@jgabry jgabry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to see what the benchmarks say but aside from that this is great and I think it's ready to merge! I'll approve it now and then assuming the benchmarks look good we can go ahead and merge.

I think we also said we would tag version 0.2.0 after merging this, right? If that still sounds good to you then after this is merged we should update the DESCRIPTION file, the website, and upload the source package to stan-dev/r-packages. I'm happy to take care of that.

@rok-cesnovar
Copy link
Member Author

Benchmark results of :

  • this branch
  • rstan::read_stan_csv
  • fread + as_draws_array (this is our floor that we can't reach because we need to handle metadata)

I did not compare with our current implementation as we know it's slower by a lot for large cases. And speedup is not the only reason we want to replace vroom with data.table.

All runs were done with 500 iterations and this model:

data {
  int N;
}
parameters {
  real k;
}
transformed parameters {
  real x[N] = rep_array(k, N);
}
model {
  k ~ normal(0, 1);
}

freadvsutilsnonlog

log scale:
freadvsutils

So this all looks good. One thing I am not exactly sure and can't explain is that in the fread + as_draws_array case, half of the time is spent in the as_draws_array.

f <- data.table::fread(cmd= paste0("grep -v '^#' ", fit$output_files()[1]))
d <- posterior::as_draws_array(f)

So for the case of N=150000, its 5 seconds for fread and 4 seconds for as_draws_array. The latter seems a lot for that, but maybe I am missing something. Anyhow that is definitely something to figure out separately, not directly connected to this PR.

@rok-cesnovar
Copy link
Member Author

For 0.2.0 I agree to do it after this is merged. I will also open another simple PR today to clean up some other issues, but none of them are critical.

@jgabry
Copy link
Member

jgabry commented Nov 12, 2020

So this all looks good. One thing I am not exactly sure and can't explain is that in the fread + as_draws_array case, half of the time is spent in the as_draws_array.

Ok yeah that looks good! I'm also surprised that half the time goes to creating the draws array. That does seem strange but I agree that's something to sort out separately so we can go ahead and merge this.

@jgabry
Copy link
Member

jgabry commented Nov 12, 2020

Ok I'm going to merge this now and then do 0.2.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants