MPI sampling #350

rok-cesnovar · 2020-11-15T20:59:32Z

Summary

A draft for MPI sampling. Still details to hash out, just wanted to get a version out.

library(cmdstanr)
setwd("~/Desktop/testing/mpi/")
cmdstan_make_local(cpp_options = list("CXX"="mpicxx", "TBB_CXX_TYPE"="gcc"))
rebuild_cmdstan(cores = 4)

mod_mpi <- cmdstan_model("logistic1.stan", cpp_options = list(stan_mpi = TRUE))
f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n = 4)
# you can use mpirun 
# f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n = 4, mpicmd = "mpirun")

Files from: https://github.com/rmcelreath/cmdstan_map_rect_tutorial

> f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n= 4)
Running MCMC with 1 chain...

Running mpiexec -n 4 /home/rok/Desktop/testing/mpi/logistic1_mpi 'id=1' random 'seed=1071310193' data \
  'file=/home/rok/Desktop/testing/mpi/redcard_input.R' output 'file=/tmp/Rtmph90xbK/logistic1_mpi-202011152151-1-5ec763.csv' \
  'method=sample' 'num_samples=1000' 'num_warmup=1000' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 62.3 seconds.

@yizhang-yiz sorry this took so long. Whenever you have time and if you are still interested, would you try it out and give your thoughts?

Use

remotes::install_github("stan-dev/cmdstanr@mpi")

to install this version.

Copyright and Licensing

Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar, Uni. of Ljubljana
Yi Zhang (initial version for testing)

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

rok-cesnovar

Question/TODO for now:

do we want a separate $mpi_sample()? I now think we could also work with $sample()
names for the number of MPI processes and mpi command arguments? Ideas welcome.
CI, need to set up a separate test for this
figure out why mod_mpi <- cmdstan_model("logistic1.stan", cpp_options = list("CXX"="mpicxx", stan_mpi = TRUE, "TBB_CXX_TYPE"="gcc")) fails. It should not.

mitzimorris · 2020-11-15T21:12:11Z

is this something folks want in Python as well?

rok-cesnovar · 2020-11-16T19:24:09Z

No idea honestly. MPI is a part of cmdstan and I figure we should support it.

Probably no one will use cmdstanx on a cluster, but I think its worth it to support stuff like the cross-chain warmup Yi et al are working on. Especially given that it does really not seem to be that big of a maintenance burden (famous last words).

jgabry · 2020-11-16T19:53:50Z

No idea honestly. MPI is a part of cmdstan and I figure we should support it.

Yeah I don't really know either. But yeah fine by me to support it if it's in CmdStan (and isn't getting deprecated anytime soon).

Probably no one will use cmdstanx on a cluster

Curious, what makes you think that?

rok-cesnovar · 2020-11-16T20:16:44Z

and isn't getting deprecated anytime soon

I think there is a big enough user base and use case that this is not going to happen.

Curious, what makes you think that?

Mostly due to the way jobs are submitted to the typical cluster via job submission scripts.

yizhang-yiz · 2020-11-16T21:30:35Z

@yizhang-yiz sorry this took so long. Whenever you have time and if you are still interested, would you try it out and give your thoughts?

mmm, on develop branch of cmdstan I'm getting

/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build/toolset.jam:44: in toolset.using
ERROR: rule "other.init" unknown in module "toolset".
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:543: in process-explicit-toolset-requests
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:610: in load
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/modules.jam:295: in import
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/bootstrap.jam:139: in boost-build
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/boost-build.jam:17: in module scope
other.jam: No such file or directory

when running

library(cmdstanr)
set_cmdstan_path("cmdstan")
cmdstan_make_local(cpp_options = list("CXX"="mpicxx", "TBB_CXX_TYPE"="gcc"))
rebuild_cmdstan(cores = 4)

rok-cesnovar · 2020-11-16T21:33:04Z

You had TBB_CXX_TYPE=clang for your system I believe. Might be that?

yizhang-yiz · 2020-11-16T21:35:23Z

You're right. gcc on mac is just alias of clang but boost got misled by the name.

mitzimorris · 2020-11-16T21:59:29Z

Mostly due to the way jobs are submitted to the typical cluster via job submission scripts.

why do you say this? people who are used to working in R or Python will use those languages to set up job submission accordingly - see Discourse discussion - https://discourse.mc-stan.org/t/correct-way-to-use-mpi-with-cmdstanpy/17667 - I wrote an example of job submission script - https://discourse.mc-stan.org/t/correct-way-to-use-mpi-with-cmdstanpy/17667/2?u=mitzimorris

yizhang-yiz · 2020-11-17T01:04:50Z

Funny, I'm still getting the same error with cmdstan master:

bash-3.2$ git branch
WARNING: terminal is not fully functional
-  (press RETURN)
  develop
* master
bash-3.2$ cat make/local
STAN_MPI=1
CXX=mpicxx
TBB_CXX_TYPE=clang
bash-3.2$ make clean-all;make -j4 build
rm -f -r test
rm -f 
rm -f 
rm -f 
rm -f 
  removing dependency files
rm -f    
rm -f   
rm -f   
  cleaning sundials targets
rm -f 
  cleaning mpi targets
rm -f 
rm -f -r stan/lib/stan_math/lib/boost_1.72.0/stage/lib stan/lib/stan_math/lib/boost_1.72.0/project-config.jam stan/lib/stan_math/lib/boost_1.72.0/b2 stan/lib/stan_math/lib/boost_1.72.0/bootstrap.log
  cleaning Intel TBB targets
rm -f -rf stan/lib/stan_math/lib/tbb
rm -f bin/stanc bin/stanc2 bin/stansummary bin/print bin/diagnose
rm -f -r src/cmdstan/main*.o bin/cmdstan
rm -f 
rm -f examples/bernoulli/bernoulli examples/bernoulli/bernoulli.o examples/bernoulli/bernoulli.d examples/bernoulli/bernoulli.hpp
rm -f -r stan/lib/stan_math/lib/boost_1.72.0/stage/lib stan/lib/stan_math/lib/boost_1.72.0/project-config.jam stan/lib/stan_math/lib/boost_1.72.0/b2 stan/lib/stan_math/lib/boost_1.72.0/bootstrap.log
curl -L https://github.com/stan-dev/stanc3/releases/download/nightly/mac-stanc -o bin/stanc --retry 5 --retry-delay 10
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/stansummary.o src/cmdstan/stansummary.cpp
cd stan/lib/stan_math/lib/boost_1.72.0; ./bootstrap.sh
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/print.o src/cmdstan/print.cpp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
         Building Boost.Build engine with toolset clang...                         Dload  Upload   Total   Spent    Left  Speed
100   635  100   635    0     0   1951      0 --:--:-- --:--:-- --:--:--  1947
100 10.4M  100 10.4M    0     0  4810k      0  0:00:02  0:00:02 --:--:-- 8840k
chmod +x bin/stanc
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/diagnose.o src/cmdstan/diagnose.cpp
tools/build/src/engine/b2
Detecting Python version... 2.7
Detecting Python root... /usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7
Unicode/ICU support for Boost.Regex?... not found.
Generating Boost.Build configuration in project-config.jam for clang...

Bootstrapping is done. To build, run:

    ./b2
    
To generate header files, run:

    ./b2 headers

To adjust configuration, edit 'project-config.jam'.
Further information:

   - Command line help:
     ./b2 --help
     
   - Getting started guide: 
     http://www.boost.org/more/getting_started/unix-variants.html
     
   - Boost.Build documentation:
     http://www.boost.org/build/

cd stan/lib/stan_math/lib/boost_1.72.0; ./b2  toolset=other --visibility=hidden --with-program_options cxxstd=11 variant=release link=static
other.jam: No such file or directory
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build/toolset.jam:44: in toolset.using
ERROR: rule "other.init" unknown in module "toolset".
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:543: in process-explicit-toolset-requests
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:610: in load
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/modules.jam:295: in import
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/bootstrap.jam:139: in boost-build
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/boost-build.jam:17: in module scope
make: *** [stan/lib/stan_math/lib/boost_1.72.0/stage/lib/libboost_program_options.a] Error 1
make: *** Waiting for unfinished jobs....

rok-cesnovar · 2020-11-17T04:23:46Z

You dont neeed to set stan_mpi before rebuilding. You can just use it when compiling the model. Will look into why its failing on rebuild though.

rok-cesnovar · 2020-11-17T05:19:57Z

As of 2.24 its no longer required to rebuild cmdstan upon setting the mpi/threads/opencl flags. If you set them for models, the main.o is rebuilt automatically.

yizhang-yiz · 2020-11-17T05:37:47Z

I don't think it's caused by rebuilding or STAN_MPI=1. With the following make/local the build fails.

CXX=mpicxx
TBB_CXX_TYPE=clang

rok-cesnovar · 2020-11-17T06:04:20Z

Ok, let me check that.This error on make build doesnt actually prevent building and using models, its just a problem for stansummary. We do need to check it out.

Hopefully I can replicate on one of the machines I have access too and its not a macos specific problem.

yizhang-yiz · 2020-11-17T06:07:25Z

Let me see if I can reproduce the issue on ubuntu, one sec.

yizhang-yiz · 2020-11-17T06:28:33Z

I can confirm that linux(ubuntu) builds fine.

rok-cesnovar · 2020-11-17T07:55:01Z

why do you say this?

I take that back. That obviously works fine.

I can confirm that linux(ubuntu) builds fine.

Thanks!

Apart from figuring out the build issue (which is a cmdstan issue anyways), the other question is how to set up the arguments. There are at least the following options:
a)

mod$sample(..., mpi_cmd = "mpiexec", mpi_nprocess = 5, mpi_args = c(...))

b)

mod$mpi_sample(..., mpi_cmd = "mpiexec", mpi_nprocess = 5, mpi_args = c(...))

c)

mod$sample(..., mpi_cmd = "mpiexec", mpi_args = c("-n", 4, ...))

d)

mod$mpi_sample(..., mpi_cmd = "mpiexec", mpi_args = c("-n", 4, ...))

mpi_cmd could also be an R option (set with (options("cmdstanr_mpi_cmd") or something), since its most likely going to be the standardized mpiexec and used in the majority, but some prefer mpirun. Or it can be a regular argument with the default "mpiexec" that most will not touch.

The other thing is if we should separate the -n/-np argument (mpi_nprocess in the above example, name still TBD) and the rest optional MPI arguments or just have those be one argument (mpi_args for example).

I do not think we actually need to separate MPI sampling in a separate function so I would go with either a) or c), slightly preferring c) because we do not have to deal with someone defining mpi_nprocess and -n in args.

jgabry · 2020-11-17T18:14:59Z

slightly preferring c) because we do not have to deal with someone defining mpi_nprocess and -n in args.

If we don't have a separate method mpi_sample (or sample_mpi) then wouldn't we need an additional argument indicating whether to even try mpi? Or are you thinking that we would automatically try mpi if mpi_cmd and mpi_args are specified?

@mitzimorris What do you think about these options? If this is going to be implemented in CmdStanPy too then we should make sure to coordinate on this.

yizhang-yiz · 2020-11-17T18:54:04Z

I'd pick a) because frequently # of procs is all one supplies for mpi runs. It'll make args slightly cleaner.

rok-cesnovar · 2020-11-17T19:28:15Z

If we don't have a separate method mpi_sample (or sample_mpi) then wouldn't we need an additional argument indicating whether to even try mpi?

There are the following options:

the model was compiled with MPI and the user specifies MPI args -> fine
the model was compiled with MPI and the user does not specify MPI args -> runs without MPI (singleton MPI), works fine (and should per MPI standard)
the model was not compiled with MPI and the user specifies MPI args -> also works, but this is just running the same chain N times, so its inneficient but not problematic.
We can catch this and stop almost immediately or just leave it to complete. But we have this same problem even with a separate function. There would be a cleaner way if Add ./model compile_info cmdstan#887 was fixed (still waiting for review in Add virtual function model_compile_info to the model_base.hpp stan#2932).
the model was not compiled with MPI and the user does not specify MPI args -> fine

I'd pick a) because frequently # of procs is all one supplies for mpi runs. It'll make args slightly cleaner.

Cool. You definitely have waaay more experience running these so I am definitely going to trust your opinion here. Count me in for a) as well.

jgabry · 2020-11-17T19:34:16Z

Ok I'm also inclined to trust @yizhang-yiz's opinion since he has the most experience with MPI (I've never even tried using it!).

@mitzimorris What do you think about the proposed function signature?

rok-cesnovar · 2020-11-17T19:36:19Z

And the argument names would then be mpi_cmd (rarely used), mpi_n/mpi_np/mpi_nprocess (n/np isn't the most descriptive but is commonly know in the MPI world) and mpi_args (rarely used).

yizhang-yiz · 2020-11-17T19:38:14Z

What we need to put consideration is how this would interact with HPC task schedulers. Things like SLURM ask for the # of procs when a job is submitted, and this input should not conflict with what we put in a)(or c)). So maybe we can allow mpi_nprocess to be eclipsed. But then we'll need come up something informative to alert the user.

jgabry · 2020-11-17T19:45:42Z

One reason I might slightly prefer a separate mpi_sample (or sample_mpi) method is because there seems to be a decent amount of documentation that we'll need just for the mpi stuff. So having a separate method could be much cleaner from a doc perspective (there's already a ton of doc for the existing sample method). That said, I suppose even if we just add arguments to the original sample method we could still have a separate doc page for mpi stuff. Not sure, but that's also something to consider.

yizhang-yiz · 2020-11-17T19:50:56Z

@jgabry has a point. Additional benefit in mpi_sample is that we'll be able to put complexities such as interaction cluster scheduler in a controlled environment, because I'm still not sure how map_rect or cross-chain warmup would evolve and how the user would be using them. If we do this, maybe a catch-all mpi_args=c("-n", 4,...) makes more sense. We can always add wrappers later on this bare-bone call.

rok-cesnovar · 2020-11-17T19:56:58Z

hings like SLURM ask for the # of procs when a job is submitted

You mean like the user sets N but SLURM overrides that with M? Not sure if we can catch that in cmdstan(r). Apart from maybe running external commands?

seems to be a decent amount of documentation that we'll need just for the mpi stuff

Good point yeah. We can start with mod$mpi_sample() then and see how things evolve. threads_per_chain is obviously useless for mpi_sample(). How about parallel_chains? Would anyone start 4 chains with N processes?

Thanks for the input @yizhang-yiz !

yizhang-yiz · 2020-11-17T20:02:49Z

You mean like the user sets N but SLURM overrides that with M? Not sure if we can catch that in cmdstan(r).

I don't think we can, which is why we'll need hand the decision to the user so they can choose which arg to provide to cmdstanr, and if mpi_nprocess is not provided, we give message but don't intervene.

jgabry · 2020-12-02T02:00:59Z

I just made a few small edits. This seems ready (thanks @rok-cesnovar) so approving now. But @yizhang-yiz if you have time can you try it out one more time and see if the doc is missing anything important?

yizhang-yiz · 2020-12-03T17:21:57Z

Sorry guys I missed the thread yesterday. I can play with it later today.

jgabry · 2020-12-03T19:50:12Z

Thanks Yi!

yizhang-yiz · 2020-12-04T08:52:18Z

Works like a charm! Thank you! @rok-cesnovar @jgabry

rok-cesnovar · 2020-12-04T10:37:46Z

Thank you for the insight, discussion, and testing!

Will then go ahead and merge. For now, this will be available in the Github version but we will most likely do a 0.2.3 or 0.3.0 release soon-ish.

yizhang-yiz · 2020-12-08T00:44:54Z

So I decided to try this on cross-chain warmup I was working on.

 library("cmdstanr")
  cmdstan_make_local(cpp_options = list("MPI_ADAPTED_WARMUP" = "1", "TBB_CXX_TYPE"="clang"))
  rebuild_cmdstan()
  mod <- cmdstan_model("cmdstan/examples/eight_schools.stan", quiet=FALSE, force_recompile=TRUE)
  f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools")

Output:

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 iteration: 100 window: 1 / 1 Rhat: 1.0289 ESS: 149.4547 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 iteration: 200 window: 1 / 2 Rhat: 1.0006 ESS: 373.0270 
Chain 1 iteration: 200 window: 2 / 2 Rhat: 1.0002 ESS: 233.4405 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Error: Supplied CSV file is corrupt!

The warmup algorithm works well, but looks like cmdstan's I/O is taken over by R, as a cmdstan run would return 4 CSV files from 4 communicating chains but in working directory there's only one empty CSV. Any idea what's going on? @rok-cesnovar

rok-cesnovar · 2020-12-08T05:54:15Z

Can you specify validate_csv = FALSE and see if the CSVs remain without the error. The CSVs might get deleted because of the error.

Will take a look, one of the reasons I was so keen on getting this MPI in cmdstanr was testing this cross-chain warmup:

yizhang-yiz · 2020-12-08T16:59:24Z

"validate_csv=false" doesn't help. The output remains to be a single empty CSV files while I expect 4 files, one for each chain. The easiest way to access this experimental feature is to use Torsten repo:
https://github.com/metrumresearchgroup/Torsten

and set cmdstan path to Torsten/cmdstan.

rok-cesnovar · 2020-12-08T17:02:09Z

Ok, thanks, will take a look now. How do the files differ? By suffix? Say I give output file = test.csv

yizhang-yiz · 2020-12-08T17:20:01Z

How do the files differ?

I don't see anything different:

> f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools",validate_csv=FALSE)
Running MCMC with 1 chain...
...
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
> f$output_files()
[1] "/Users/yiz/Work/Torsten/cmdstan/examples/eight_schools/eight_schools-202012080911-1-818735.csv"
> f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools")
Running MCMC with 1 chain...
...
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.1 seconds.
Error: Supplied CSV file is corrupt!
> f$output_files()
[1] "/Users/yiz/Work/Torsten/cmdstan/examples/eight_schools/eight_schools-202012080911-1-818735.csv"

While in cmdstan I'm getting

bash-3.2$ make -j4 examples/eight_schools/eight_schools && cd examples/eight_schools/
bash-3.2$ mpiexec -n 4 -l ./eight_schools sample data file=eight_schools.data.R
...
[0]  Elapsed Time: 0.016 seconds (Warm-up)
[0]                0.07 seconds (Sampling)
[0]                0.086 seconds (Total)
[0] 
bash-3.2$ ls *.csv
mpi.0.output.csv	mpi.1.output.csv	mpi.2.output.csv	mpi.3.output.csv

rok-cesnovar · 2020-12-08T17:27:43Z

Thanks, I was asking for mpi.0.output.csv mpi.1.output.csv mpi.2.output.csv mpi.3.output.csv but worded it weirdly.

rok-cesnovar · 2020-12-08T17:35:28Z

This is the command that is used

mpiexec -n 4 /home/rok/Desktop/test_models/eight_schools/eight_schools 'id=1' random 'seed=208921277' data 'file=/home/rok/Desktop/test_models/eight_schools/schools.data.json' output 'file=/home/rok/Desktop/test_models/eight_schools/eight_schools-202012081831-1-6f6627.csv' 'refresh=200' 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'

Running this in the command line produces the same thing (1 CSV).

rok-cesnovar · 2020-12-08T17:42:38Z

You can install remotes::install_github("stan-dev/cmdstanr@echomd") that will print the command.

yizhang-yiz · 2020-12-08T17:43:41Z

Great! Thanks. Then that's likely caused by bugs in my code.

rok-cesnovar · 2020-12-08T17:51:36Z

I am not so sure just yet.

yizhang-yiz · 2020-12-08T17:59:14Z

I can confirm replacing output file in the above command with output 'file=eight_schools-202012081831-1-6f6627.csv' makes it work:

bash-3.2$ mpiexec -n 4 ./eight_schools 'id=1' random 'seed=208921277' data 'file=eight_schools.data.R' output 'file=eight_schools-202012081831-1-6f6627.csv' 'refresh=200' 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1' &> out.log
bash-3.2$ ls *.csv
mpi.0.eight_schools-202012081831-1-6f6627.csv	mpi.2.eight_schools-202012081831-1-6f6627.csv
mpi.1.eight_schools-202012081831-1-6f6627.csv	mpi.3.eight_schools-202012081831-1-6f6627.csv

so I must have messed up the ostream path.

rok-cesnovar · 2020-12-08T18:00:43Z

I would say this is the culprit yes if the file is specified with absolute paths:

data
  file = /home/rok/Desktop/test_models/eight_schools/schools.data.json
init = 2 (Default)
random
  seed = 106570872
output
  file = mpi.0./home/rok/Desktop/test_models/eight_schools/eight_schools-202012081852-1-13cb4f.csv
  diagnostic_file =  (Default)
  refresh = 200
  sig_figs = -1 (Default)

yizhang-yiz · 2020-12-08T19:52:22Z

Just fixed it, now it works

f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools",validate_csv=FALSE)
Running MCMC with 1 chain...

Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 1 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.1.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1       num_cross_chains = 4 (Default) 
Chain 1       cross_chain_window = 100 (Default) 
Chain 1       cross_chain_rhat = 1.05 (Default) 
Chain 1       cross_chain_ess = 200 (Default) 
Chain 1     algorithm = hmc (Default) 
Chain 1       hmc 
Chain 1         engine = nuts (Default) 
Chain 1           nuts 
Chain 1             max_depth = 10 (Default) 
Chain 1         metric = diag_e (Default) 
Chain 1         metric_file =  (Default) 
Chain 1         stepsize = 1 (Default) 
Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 2 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.2.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1       t0 = 10 (Default) 
Chain 1       init_buffer = 75 (Default) 
Chain 1       term_buffer = 50 (Default) 
Chain 1       window = 25 (Default) 
Chain 1       num_cross_chains = 4 (Default) 
Chain 1       cross_chain_window = 100 (Default) 
Chain 1       cross_chain_rhat = 1.05 (Default) 
Chain 1       cross_chain_ess = 200 (Default) 
Chain 1     algorithm = hmc (Default) 
Chain 1       hmc 
Chain 1         engine = nuts (Default) 
Chain 1           nuts 
Chain 1             max_depth = 10 (Default) 
Chain 1         metric = diag_e (Default) 
Chain 1         metric_file =  (Default) 
Chain 1         stepsize = 1 (Default) 
Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 3 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.3.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 iteration: 100 window: 1 / 1 Rhat: 1.0212 ESS: 79.6405 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 iteration: 200 window: 1 / 2 Rhat: 1.0181 ESS: 170.9020 
Chain 1 iteration: 200 window: 2 / 2 Rhat: 1.0141 ESS: 135.0918 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 iteration: 300 window: 1 / 3 Rhat: 1.0104 ESS: 310.1229 
Chain 1 iteration: 300 window: 2 / 3 Rhat: 1.0052 ESS: 275.0363 
Chain 1 iteration: 300 window: 3 / 3 Rhat: 1.0006 ESS: 144.5106 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
...
Chain 1 finished in 0.1 seconds.

Though there's still a dummy csv file generated and pointed to by output_files():

f$output_files()
[1] "/Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.csv"

In addition, what's the best way to add custom options to sample_mpi call for the following?

mpiexec -n 4 ./eight_schools sample adapt cross_chain_ess=400 data file...

Here cross_chain_ess is an extra option under adapt family.

rok-cesnovar · 2020-12-08T20:11:42Z

Just fixed it, now it works

Yay!

Though there's still a dummy CSV file generated and pointed to by

Hm, that would probably be because of https://github.com/stan-dev/cmdstanr/blob/master/R/args.R#L99
I am not actually sure we need that but would have to check.

In addition, what's the best way to add custom options to sample_mpi call for the following?

See commit dbee414 and just duplicate for other args :)

initial version of mpi support

45df936

rok-cesnovar commented Nov 15, 2020

View reviewed changes

rok-cesnovar marked this pull request as ready for review November 17, 2020 06:18

Merge branch 'master' into mpi

0b4b42a

jgabry added 4 commits December 1, 2020 18:43

revert one of the doc edits

d46b8d7

clarify that sample_mpi is missing a few arguments

3ffd517

rename test file

641d0c3

don't need to define parallel_chains

6db11dc

jgabry approved these changes Dec 2, 2020

View reviewed changes

rok-cesnovar and others added 3 commits December 3, 2020 21:28

Merge branch 'master' into mpi

234f1d1

update NEWS.md after release

cf22f84

remove duplicate news item

6e1c24b

rok-cesnovar merged commit 291ceb3 into master Dec 4, 2020

rok-cesnovar deleted the mpi branch December 4, 2020 10:37

rok-cesnovar mentioned this pull request Dec 8, 2020

Verbose mode #391

Closed

MPI sampling #350

MPI sampling #350

Conversation

rok-cesnovar commented Nov 15, 2020 • edited Loading

Summary

Copyright and Licensing

rok-cesnovar left a comment

Choose a reason for hiding this comment

mitzimorris commented Nov 15, 2020

rok-cesnovar commented Nov 16, 2020

jgabry commented Nov 16, 2020

rok-cesnovar commented Nov 16, 2020

yizhang-yiz commented Nov 16, 2020 • edited Loading

rok-cesnovar commented Nov 16, 2020

yizhang-yiz commented Nov 16, 2020

mitzimorris commented Nov 16, 2020

yizhang-yiz commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020

yizhang-yiz commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020

yizhang-yiz commented Nov 17, 2020 • edited Loading

yizhang-yiz commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020

jgabry commented Nov 17, 2020

yizhang-yiz commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020

jgabry commented Nov 17, 2020

rok-cesnovar commented Nov 17, 2020 • edited Loading

yizhang-yiz commented Nov 17, 2020

jgabry commented Nov 17, 2020

yizhang-yiz commented Nov 17, 2020 • edited Loading

rok-cesnovar commented Nov 17, 2020

yizhang-yiz commented Nov 17, 2020

jgabry commented Dec 2, 2020

yizhang-yiz commented Dec 3, 2020

jgabry commented Dec 3, 2020

yizhang-yiz commented Dec 4, 2020

rok-cesnovar commented Dec 4, 2020

yizhang-yiz commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020

yizhang-yiz commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020

yizhang-yiz commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020 • edited Loading

rok-cesnovar commented Dec 8, 2020

yizhang-yiz commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020

yizhang-yiz commented Dec 8, 2020

rok-cesnovar commented Dec 8, 2020

yizhang-yiz commented Dec 8, 2020 • edited Loading

rok-cesnovar commented Dec 8, 2020 • edited Loading

rok-cesnovar commented Nov 15, 2020 •

edited

Loading

yizhang-yiz commented Nov 16, 2020 •

edited

Loading

yizhang-yiz commented Nov 17, 2020 •

edited

Loading

rok-cesnovar commented Nov 17, 2020 •

edited

Loading

yizhang-yiz commented Nov 17, 2020 •

edited

Loading

rok-cesnovar commented Dec 8, 2020 •

edited

Loading

yizhang-yiz commented Dec 8, 2020 •

edited

Loading

rok-cesnovar commented Dec 8, 2020 •

edited

Loading