Replace vroom with data.table::fread #318

rok-cesnovar · 2020-10-16T13:37:25Z

Summary

Replaces vroom with data.table

Closes #299
closes #198
closes #262

Copyright and Licensing

Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar, Univ. of Ljubljana

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

codecov-io · 2020-10-17T17:36:31Z

Codecov Report

Merging #318 (e1d1c60) into master (962e7c1) will increase coverage by 0.24%.
The diff coverage is 98.66%.

@@            Coverage Diff             @@
##           master     #318      +/-   ##
==========================================
+ Coverage   89.52%   89.77%   +0.24%     
==========================================
  Files          12       12              
  Lines        2588     2563      -25     
==========================================
- Hits         2317     2301      -16     
+ Misses        271      262       -9

Impacted Files	Coverage Δ
R/read_csv.R	`98.77% <98.46%> (+2.45%)`	⬆️
R/data.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 962e7c1...e1d1c60. Read the comment docs.

jgabry

Thanks for this, looks good! I made a few tiny comments, but otherwise the main thing is that it's a bit difficult for me to parse (pun intended) the changes to the read_csv_metadata() function from the diff. It's a bit hard to follow. Are there particular lines that I should focus on checking?

Also do you know why the windows unit tests are failing but not mac or linux?

jgabry · 2020-10-20T23:24:38Z

R/data.R

-          chain_draws <- posterior::as_draws_df(posterior::subset_draws(draws, chain = chain))
-          colnames(chain_draws) <- unrepair_variable_names(variables)
+          chain_draws <- posterior::subset_draws(draws, chain = chain)
+          unname(chain_draws)         


Two things:

Why do we need to remove the names here? That's fine if necessary, just curious.

Right now this isn't assigned to anything. I think you need

chain_draws <- unname(chain_draws)

Why do we need to remove the names here? That's fine if necessary, just curious.

Because it otherwise writes the iteration ids in the CSV.

Right now this isn't assigned to anything. I think you need

Hm, this did help with that, will double-check.

jgabry · 2020-10-20T23:27:58Z

tests/testthat/test-fit-mcmc.R

+# test_that("inv_metric method works after mcmc", {
+#   skip_on_cran()
+#   x <- fit_mcmc_1$inv_metric()
+#   expect_length(x, fit_mcmc_1$num_chains())
+#   checkmate::expect_matrix(x[[1]])
+#   checkmate::expect_matrix(x[[2]])
+#   expect_equal(x[[1]], diag(diag(x[[1]])))
+#
+#   x <- fit_mcmc_1$inv_metric(matrix=FALSE)
+#   expect_length(x, fit_mcmc_1$num_chains())
+#   expect_null(dim(x[[1]]))
+#   checkmate::expect_numeric(x[[1]])
+#   checkmate::expect_numeric(x[[2]])
+#
+#   x <- fit_mcmc_2$inv_metric()
+#   expect_length(x, fit_mcmc_2$num_chains())
+#   checkmate::expect_matrix(x[[1]])
+#   expect_false(x[[1]][1,2] == 0) # dense
+# })


Are all these lines commented out on purpose?

This test fails on macOS machine on the CI. I am currently unable to debug further. If you have a few minutes, mind running this test if it fails for you?

Yeah I can check in a few min and let you know

This test passes on my mac

thanks! Thats good news and bad news at the same time :)

yeah good and bad. maybe we should let it run on CI again and I can see if I can debug it

rok-cesnovar · 2020-10-23T18:11:31Z

Are there particular lines that I should focus on checking?

Will explain a bit more once I get all tests passing. The main idea is that we read in all "metadata" lines and parse them then. Previously we read metadata lines line-by-line.

Also do you know why the windows unit tests are failing but not mac or linux?

Its having problems finding grep. grep is a part of the Rtools installation so this is just a CI configuration thing I believe.

rok-cesnovar · 2020-11-01T12:18:58Z

Status update: still issues with tests. I can use cmdstanr with fread on Windows locally (run a model and read in draws works fine) but cant get tests to pass both locally or in CI. Non-windows side of thing is ready, however.

jgabry · 2020-11-01T20:10:17Z

Thanks for the update. Is the problem on windows still related to failing to find "grep"?

rok-cesnovar · 2020-11-01T20:12:04Z

Seems so. Though it seems to find it in “normal use” but not when running tests. Should have time look into this a bit more this week and finish this up.

jgabry · 2020-11-01T20:15:41Z

That's strange. Sorry for the hassle, that sounds annoying, but thanks for working on this!

mike-lawrence · 2020-11-04T21:43:10Z

You don't have Cygwin installed on your local Windows machine, do you? Here's someone similarly having trouble with the rtools-installed grep, and I see that the data.table maintainer seems to suggest installing grep via Cygwin as the primary way to get it working.

# Conflicts: # tests/testthat/test-csv.R

# Conflicts: # .github/workflows/Test-coverage.yaml # tests/testthat/test-fit-mle.R # tests/testthat/test-fit-shared.R # tests/testthat/test-model-compile.R # tests/testthat/test-model-sample.R

rok-cesnovar

Eureka!

I think I solved the Windows CI issue and this is ready for review and merge finally. That took way too long.

Posting a few comments to make the review easier. Apart from replacing vroom the biggest change is the refactored metadata read.

tests/testthat/test-fit-gq.R

R/read_csv.R

rok-cesnovar · 2020-11-11T19:54:45Z

Here's someone similarly having trouble with the rtools-installed grep, and I see that the data.table maintainer seems to suggest installing grep via Cygwin as the primary way to get it working.

Thanks, @mike-lawrence for the suggestion. But even if this turned out to work, that would not be doable for cmdstanr as it would require users to also have Cygwin installed. I think I managed to find a workaround that works for Windows locally and in CI.

rok-cesnovar · 2020-11-11T20:51:28Z

I will run some benchmarks tomorrow to double check everything.

jgabry · 2020-11-11T20:56:47Z

Awesome! I'll try to review this soon.

jgabry

Curious to see what the benchmarks say but aside from that this is great and I think it's ready to merge! I'll approve it now and then assuming the benchmarks look good we can go ahead and merge.

I think we also said we would tag version 0.2.0 after merging this, right? If that still sounds good to you then after this is merged we should update the DESCRIPTION file, the website, and upload the source package to stan-dev/r-packages. I'm happy to take care of that.

rok-cesnovar · 2020-11-12T11:53:53Z

Benchmark results of :

this branch
rstan::read_stan_csv
fread + as_draws_array (this is our floor that we can't reach because we need to handle metadata)

I did not compare with our current implementation as we know it's slower by a lot for large cases. And speedup is not the only reason we want to replace vroom with data.table.

All runs were done with 500 iterations and this model:

data {
  int N;
}
parameters {
  real k;
}
transformed parameters {
  real x[N] = rep_array(k, N);
}
model {
  k ~ normal(0, 1);
}

log scale:

So this all looks good. One thing I am not exactly sure and can't explain is that in the fread + as_draws_array case, half of the time is spent in the as_draws_array.

f <- data.table::fread(cmd= paste0("grep -v '^#' ", fit$output_files()[1]))
d <- posterior::as_draws_array(f)

So for the case of N=150000, its 5 seconds for fread and 4 seconds for as_draws_array. The latter seems a lot for that, but maybe I am missing something. Anyhow that is definitely something to figure out separately, not directly connected to this PR.

rok-cesnovar · 2020-11-12T17:07:57Z

For 0.2.0 I agree to do it after this is merged. I will also open another simple PR today to clean up some other issues, but none of them are critical.

jgabry · 2020-11-12T17:40:09Z

So this all looks good. One thing I am not exactly sure and can't explain is that in the fread + as_draws_array case, half of the time is spent in the as_draws_array.

Ok yeah that looks good! I'm also surprised that half the time goes to creating the draws array. That does seem strange but I agree that's something to sort out separately so we can go ahead and merge this.

jgabry · 2020-11-12T17:42:01Z

Ok I'm going to merge this now and then do 0.2.0!

rok-cesnovar added 3 commits October 16, 2020 13:10

replace vroom in read_csv

cc32b93

replace vroom for write and replace vroom in DESCRIPTION

5a13c0d

fix tests

b33de6f

rok-cesnovar force-pushed the fread branch from 51ae844 to d504fa3 Compare October 16, 2020 14:24

test

980d82f

rok-cesnovar force-pushed the fread branch from d504fa3 to 980d82f Compare October 16, 2020 18:20

rok-cesnovar added 6 commits October 16, 2020 20:33

cleanup prints

d75f277

test on non-windows only

d0e65cd

debug

b8b9010

debug2

6c117c3

debug3

11da7c4

debug3

29654e1

rok-cesnovar force-pushed the fread branch from 29654e1 to 546453c Compare October 16, 2020 20:08

rok-cesnovar changed the title ~~Fread~~ Replace vroom with data.table::fread Oct 16, 2020

rok-cesnovar force-pushed the fread branch 2 times, most recently from f664109 to 760e68c Compare October 17, 2020 17:05

Merge branch 'master' into fread

34fcbe1

rok-cesnovar force-pushed the fread branch from c7f99f0 to 34fcbe1 Compare October 17, 2020 18:21

jgabry reviewed Oct 20, 2020

View reviewed changes

mike-lawrence mentioned this pull request Nov 9, 2020

Consider more efficient serialization of CmdStanFit objects #340

Closed

rok-cesnovar added 3 commits November 11, 2020 14:05

Merge branch 'fread' of https://github.com/stan-dev/cmdstanr into fread

dee701f

# Conflicts: # tests/testthat/test-csv.R

Merge branch 'master' into fread

6fe57ce

# Conflicts: # .github/workflows/Test-coverage.yaml # tests/testthat/test-fit-mle.R # tests/testthat/test-fit-shared.R # tests/testthat/test-model-compile.R # tests/testthat/test-model-sample.R

cleanup

05fa40e

rok-cesnovar added 8 commits November 11, 2020 14:40

more cleanup

8e5ccc5

fix test

ca0d5b4

.exe for windows

38fa381

try full path

cddf1c7

explicitly add path to grep

bd31b44

restore pre-debug state

a6347f5

fix indent

46b3a46

fix another indent

d37a2dc

rok-cesnovar commented Nov 11, 2020

View reviewed changes

tests/testthat/test-fit-gq.R Show resolved Hide resolved

R/read_csv.R Show resolved Hide resolved

R/read_csv.R Show resolved Hide resolved

R/read_csv.R Show resolved Hide resolved

R/read_csv.R Show resolved Hide resolved

jgabry approved these changes Nov 12, 2020

View reviewed changes

jgabry and others added 3 commits November 11, 2020 18:57

Update NEWS.md

c7182bb

speedup parsing parameters

eee2fb6

fix for fixed_param

e1d1c60

jgabry merged commit baed002 into master Nov 12, 2020

jgabry deleted the fread branch November 12, 2020 17:55

rok-cesnovar mentioned this pull request Nov 12, 2020

Minor fixes and version 0.2.0 #343

Merged

2 tasks

helske mentioned this pull request Apr 4, 2024

Faster extraction of posterior samples from Stan output ropensci/dynamite#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace vroom with data.table::fread #318

Replace vroom with data.table::fread #318

rok-cesnovar commented Oct 16, 2020 •

edited

Loading

codecov-io commented Oct 17, 2020 •

edited

Loading

jgabry left a comment

jgabry Oct 20, 2020

rok-cesnovar Oct 23, 2020

jgabry Oct 20, 2020

rok-cesnovar Oct 23, 2020

jgabry Oct 23, 2020

jgabry Oct 23, 2020

rok-cesnovar Oct 23, 2020

jgabry Oct 23, 2020

rok-cesnovar commented Oct 23, 2020

rok-cesnovar commented Nov 1, 2020

jgabry commented Nov 1, 2020

rok-cesnovar commented Nov 1, 2020

jgabry commented Nov 1, 2020

mike-lawrence commented Nov 4, 2020

rok-cesnovar left a comment

rok-cesnovar commented Nov 11, 2020

rok-cesnovar commented Nov 11, 2020

jgabry commented Nov 11, 2020

jgabry left a comment •

edited

Loading

rok-cesnovar commented Nov 12, 2020

rok-cesnovar commented Nov 12, 2020

jgabry commented Nov 12, 2020

jgabry commented Nov 12, 2020

Replace vroom with data.table::fread #318

Replace vroom with data.table::fread #318

Conversation

rok-cesnovar commented Oct 16, 2020 • edited Loading

Summary

Copyright and Licensing

codecov-io commented Oct 17, 2020 • edited Loading

Codecov Report

jgabry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok-cesnovar commented Oct 23, 2020

rok-cesnovar commented Nov 1, 2020

jgabry commented Nov 1, 2020

rok-cesnovar commented Nov 1, 2020

jgabry commented Nov 1, 2020

mike-lawrence commented Nov 4, 2020

rok-cesnovar left a comment

Choose a reason for hiding this comment

rok-cesnovar commented Nov 11, 2020

rok-cesnovar commented Nov 11, 2020

jgabry commented Nov 11, 2020

jgabry left a comment • edited Loading

Choose a reason for hiding this comment

rok-cesnovar commented Nov 12, 2020

rok-cesnovar commented Nov 12, 2020

jgabry commented Nov 12, 2020

jgabry commented Nov 12, 2020

rok-cesnovar commented Oct 16, 2020 •

edited

Loading

codecov-io commented Oct 17, 2020 •

edited

Loading

jgabry left a comment •

edited

Loading