-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace vroom with data.table::fread #318
Conversation
f664109
to
760e68c
Compare
Codecov Report
@@ Coverage Diff @@
## master #318 +/- ##
==========================================
+ Coverage 89.52% 89.77% +0.24%
==========================================
Files 12 12
Lines 2588 2563 -25
==========================================
- Hits 2317 2301 -16
+ Misses 271 262 -9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, looks good! I made a few tiny comments, but otherwise the main thing is that it's a bit difficult for me to parse (pun intended) the changes to the read_csv_metadata()
function from the diff. It's a bit hard to follow. Are there particular lines that I should focus on checking?
Also do you know why the windows unit tests are failing but not mac or linux?
R/data.R
Outdated
chain_draws <- posterior::as_draws_df(posterior::subset_draws(draws, chain = chain)) | ||
colnames(chain_draws) <- unrepair_variable_names(variables) | ||
chain_draws <- posterior::subset_draws(draws, chain = chain) | ||
unname(chain_draws) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
-
Why do we need to remove the names here? That's fine if necessary, just curious.
-
Right now this isn't assigned to anything. I think you need
chain_draws <- unname(chain_draws)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to remove the names here? That's fine if necessary, just curious.
Because it otherwise writes the iteration ids in the CSV.
Right now this isn't assigned to anything. I think you need
Hm, this did help with that, will double-check.
tests/testthat/test-fit-mcmc.R
Outdated
# test_that("inv_metric method works after mcmc", { | ||
# skip_on_cran() | ||
# x <- fit_mcmc_1$inv_metric() | ||
# expect_length(x, fit_mcmc_1$num_chains()) | ||
# checkmate::expect_matrix(x[[1]]) | ||
# checkmate::expect_matrix(x[[2]]) | ||
# expect_equal(x[[1]], diag(diag(x[[1]]))) | ||
# | ||
# x <- fit_mcmc_1$inv_metric(matrix=FALSE) | ||
# expect_length(x, fit_mcmc_1$num_chains()) | ||
# expect_null(dim(x[[1]])) | ||
# checkmate::expect_numeric(x[[1]]) | ||
# checkmate::expect_numeric(x[[2]]) | ||
# | ||
# x <- fit_mcmc_2$inv_metric() | ||
# expect_length(x, fit_mcmc_2$num_chains()) | ||
# checkmate::expect_matrix(x[[1]]) | ||
# expect_false(x[[1]][1,2] == 0) # dense | ||
# }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these lines commented out on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails on macOS machine on the CI. I am currently unable to debug further. If you have a few minutes, mind running this test if it fails for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can check in a few min and let you know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes on my mac
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! Thats good news and bad news at the same time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good and bad. maybe we should let it run on CI again and I can see if I can debug it
Will explain a bit more once I get all tests passing. The main idea is that we read in all "metadata" lines and parse them then. Previously we read metadata lines line-by-line.
Its having problems finding grep. grep is a part of the Rtools installation so this is just a CI configuration thing I believe. |
Status update: still issues with tests. I can use cmdstanr with fread on Windows locally (run a model and read in draws works fine) but cant get tests to pass both locally or in CI. Non-windows side of thing is ready, however. |
Thanks for the update. Is the problem on windows still related to failing to find "grep"? |
Seems so. Though it seems to find it in “normal use” but not when running tests. Should have time look into this a bit more this week and finish this up. |
That's strange. Sorry for the hassle, that sounds annoying, but thanks for working on this! |
You don't have Cygwin installed on your local Windows machine, do you? Here's someone similarly having trouble with the rtools-installed grep, and I see that the data.table maintainer seems to suggest installing grep via Cygwin as the primary way to get it working. |
# Conflicts: # tests/testthat/test-csv.R
# Conflicts: # .github/workflows/Test-coverage.yaml # tests/testthat/test-fit-mle.R # tests/testthat/test-fit-shared.R # tests/testthat/test-model-compile.R # tests/testthat/test-model-sample.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eureka!
I think I solved the Windows CI issue and this is ready for review and merge finally. That took way too long.
Posting a few comments to make the review easier. Apart from replacing vroom the biggest change is the refactored metadata read.
Thanks, @mike-lawrence for the suggestion. But even if this turned out to work, that would not be doable for cmdstanr as it would require users to also have Cygwin installed. I think I managed to find a workaround that works for Windows locally and in CI. |
I will run some benchmarks tomorrow to double check everything. |
Awesome! I'll try to review this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to see what the benchmarks say but aside from that this is great and I think it's ready to merge! I'll approve it now and then assuming the benchmarks look good we can go ahead and merge.
I think we also said we would tag version 0.2.0 after merging this, right? If that still sounds good to you then after this is merged we should update the DESCRIPTION file, the website, and upload the source package to stan-dev/r-packages. I'm happy to take care of that.
Benchmark results of :
I did not compare with our current implementation as we know it's slower by a lot for large cases. And speedup is not the only reason we want to replace vroom with data.table. All runs were done with 500 iterations and this model: data {
int N;
}
parameters {
real k;
}
transformed parameters {
real x[N] = rep_array(k, N);
}
model {
k ~ normal(0, 1);
} So this all looks good. One thing I am not exactly sure and can't explain is that in the f <- data.table::fread(cmd= paste0("grep -v '^#' ", fit$output_files()[1]))
d <- posterior::as_draws_array(f) So for the case of N=150000, its 5 seconds for fread and 4 seconds for as_draws_array. The latter seems a lot for that, but maybe I am missing something. Anyhow that is definitely something to figure out separately, not directly connected to this PR. |
For 0.2.0 I agree to do it after this is merged. I will also open another simple PR today to clean up some other issues, but none of them are critical. |
Ok yeah that looks good! I'm also surprised that half the time goes to creating the draws array. That does seem strange but I agree that's something to sort out separately so we can go ahead and merge this. |
Ok I'm going to merge this now and then do 0.2.0! |
Summary
Replaces vroom with data.table
Closes #299
closes #198
closes #262
Copyright and Licensing
Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar, Univ. of Ljubljana
By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses: