Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI sampling #350

Merged
merged 30 commits into from
Dec 4, 2020
Merged

MPI sampling #350

merged 30 commits into from
Dec 4, 2020

Conversation

rok-cesnovar
Copy link
Member

@rok-cesnovar rok-cesnovar commented Nov 15, 2020

Summary

A draft for MPI sampling. Still details to hash out, just wanted to get a version out.

library(cmdstanr)
setwd("~/Desktop/testing/mpi/")
cmdstan_make_local(cpp_options = list("CXX"="mpicxx", "TBB_CXX_TYPE"="gcc"))
rebuild_cmdstan(cores = 4)

mod_mpi <- cmdstan_model("logistic1.stan", cpp_options = list(stan_mpi = TRUE))
f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n = 4)
# you can use mpirun 
# f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n = 4, mpicmd = "mpirun")

Files from: https://github.com/rmcelreath/cmdstan_map_rect_tutorial

> f <- mod_mpi$mpi_sample(data = "redcard_input.R", chains = 1, iter_warmup = 1000, iter_sampling = 1000, n= 4)
Running MCMC with 1 chain...

Running mpiexec -n 4 /home/rok/Desktop/testing/mpi/logistic1_mpi 'id=1' random 'seed=1071310193' data \
  'file=/home/rok/Desktop/testing/mpi/redcard_input.R' output 'file=/tmp/Rtmph90xbK/logistic1_mpi-202011152151-1-5ec763.csv' \
  'method=sample' 'num_samples=1000' 'num_warmup=1000' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 62.3 seconds.

@yizhang-yiz sorry this took so long. Whenever you have time and if you are still interested, would you try it out and give your thoughts?

Use

remotes::install_github("stan-dev/cmdstanr@mpi")

to install this version.

Copyright and Licensing

Please list the copyright holder for the work you are submitting
(this will be you or your assignee, such as a university or company):
Rok Češnovar, Uni. of Ljubljana
Yi Zhang (initial version for testing)

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

Copy link
Member Author

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question/TODO for now:

  • do we want a separate $mpi_sample()? I now think we could also work with $sample()
  • names for the number of MPI processes and mpi command arguments? Ideas welcome.
  • CI, need to set up a separate test for this
  • figure out why mod_mpi <- cmdstan_model("logistic1.stan", cpp_options = list("CXX"="mpicxx", stan_mpi = TRUE, "TBB_CXX_TYPE"="gcc")) fails. It should not.

@mitzimorris
Copy link
Member

is this something folks want in Python as well?

@rok-cesnovar
Copy link
Member Author

No idea honestly. MPI is a part of cmdstan and I figure we should support it.

Probably no one will use cmdstanx on a cluster, but I think its worth it to support stuff like the cross-chain warmup Yi et al are working on. Especially given that it does really not seem to be that big of a maintenance burden (famous last words).

@jgabry
Copy link
Member

jgabry commented Nov 16, 2020

No idea honestly. MPI is a part of cmdstan and I figure we should support it.

Yeah I don't really know either. But yeah fine by me to support it if it's in CmdStan (and isn't getting deprecated anytime soon).

Probably no one will use cmdstanx on a cluster

Curious, what makes you think that?

@rok-cesnovar
Copy link
Member Author

and isn't getting deprecated anytime soon

I think there is a big enough user base and use case that this is not going to happen.

Curious, what makes you think that?

Mostly due to the way jobs are submitted to the typical cluster via job submission scripts.

@yizhang-yiz
Copy link

yizhang-yiz commented Nov 16, 2020

@yizhang-yiz sorry this took so long. Whenever you have time and if you are still interested, would you try it out and give your thoughts?

mmm, on develop branch of cmdstan I'm getting

/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build/toolset.jam:44: in toolset.using
ERROR: rule "other.init" unknown in module "toolset".
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:543: in process-explicit-toolset-requests
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:610: in load
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/modules.jam:295: in import
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/bootstrap.jam:139: in boost-build
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/boost-build.jam:17: in module scope
other.jam: No such file or directory

when running

library(cmdstanr)
set_cmdstan_path("cmdstan")
cmdstan_make_local(cpp_options = list("CXX"="mpicxx", "TBB_CXX_TYPE"="gcc"))
rebuild_cmdstan(cores = 4)

@rok-cesnovar
Copy link
Member Author

You had TBB_CXX_TYPE=clang for your system I believe. Might be that?

@yizhang-yiz
Copy link

You're right. gcc on mac is just alias of clang but boost got misled by the name.

@mitzimorris
Copy link
Member

Mostly due to the way jobs are submitted to the typical cluster via job submission scripts.

why do you say this? people who are used to working in R or Python will use those languages to set up job submission accordingly - see Discourse discussion - https://discourse.mc-stan.org/t/correct-way-to-use-mpi-with-cmdstanpy/17667 - I wrote an example of job submission script - https://discourse.mc-stan.org/t/correct-way-to-use-mpi-with-cmdstanpy/17667/2?u=mitzimorris

@yizhang-yiz
Copy link

Funny, I'm still getting the same error with cmdstan master:

bash-3.2$ git branch
WARNING: terminal is not fully functional
-  (press RETURN)
  develop
* master
bash-3.2$ cat make/local
STAN_MPI=1
CXX=mpicxx
TBB_CXX_TYPE=clang
bash-3.2$ make clean-all;make -j4 build
rm -f -r test
rm -f 
rm -f 
rm -f 
rm -f 
  removing dependency files
rm -f    
rm -f   
rm -f   
  cleaning sundials targets
rm -f 
  cleaning mpi targets
rm -f 
rm -f -r stan/lib/stan_math/lib/boost_1.72.0/stage/lib stan/lib/stan_math/lib/boost_1.72.0/project-config.jam stan/lib/stan_math/lib/boost_1.72.0/b2 stan/lib/stan_math/lib/boost_1.72.0/bootstrap.log
  cleaning Intel TBB targets
rm -f -rf stan/lib/stan_math/lib/tbb
rm -f bin/stanc bin/stanc2 bin/stansummary bin/print bin/diagnose
rm -f -r src/cmdstan/main*.o bin/cmdstan
rm -f 
rm -f examples/bernoulli/bernoulli examples/bernoulli/bernoulli.o examples/bernoulli/bernoulli.d examples/bernoulli/bernoulli.hpp
rm -f -r stan/lib/stan_math/lib/boost_1.72.0/stage/lib stan/lib/stan_math/lib/boost_1.72.0/project-config.jam stan/lib/stan_math/lib/boost_1.72.0/b2 stan/lib/stan_math/lib/boost_1.72.0/bootstrap.log
curl -L https://github.com/stan-dev/stanc3/releases/download/nightly/mac-stanc -o bin/stanc --retry 5 --retry-delay 10
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/stansummary.o src/cmdstan/stansummary.cpp
cd stan/lib/stan_math/lib/boost_1.72.0; ./bootstrap.sh
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/print.o src/cmdstan/print.cpp
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
         Building Boost.Build engine with toolset clang...                         Dload  Upload   Total   Spent    Left  Speed
100   635  100   635    0     0   1951      0 --:--:-- --:--:-- --:--:--  1947
100 10.4M  100 10.4M    0     0  4810k      0  0:00:02  0:00:02 --:--:-- 8840k
chmod +x bin/stanc
mpicxx -std=c++1y -D_REENTRANT -Wno-ignored-attributes    -Wno-delete-non-virtual-dtor  -I stan/lib/stan_math/lib/tbb_2019_U8/include   -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.7 -I stan/lib/stan_math/lib/boost_1.72.0 -I stan/lib/stan_math/lib/sundials_5.2.0/include    -DBOOST_DISABLE_ASSERTS        -c -fvisibility=hidden -o bin/cmdstan/diagnose.o src/cmdstan/diagnose.cpp
tools/build/src/engine/b2
Detecting Python version... 2.7
Detecting Python root... /usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7
Unicode/ICU support for Boost.Regex?... not found.
Generating Boost.Build configuration in project-config.jam for clang...

Bootstrapping is done. To build, run:

    ./b2
    
To generate header files, run:

    ./b2 headers

To adjust configuration, edit 'project-config.jam'.
Further information:

   - Command line help:
     ./b2 --help
     
   - Getting started guide: 
     http://www.boost.org/more/getting_started/unix-variants.html
     
   - Boost.Build documentation:
     http://www.boost.org/build/

cd stan/lib/stan_math/lib/boost_1.72.0; ./b2  toolset=other --visibility=hidden --with-program_options cxxstd=11 variant=release link=static
other.jam: No such file or directory
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build/toolset.jam:44: in toolset.using
ERROR: rule "other.init" unknown in module "toolset".
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:543: in process-explicit-toolset-requests
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/build-system.jam:610: in load
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/modules.jam:295: in import
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/tools/build/src/kernel/bootstrap.jam:139: in boost-build
/Users/yiz/Work/temp/cmdstan/stan/lib/stan_math/lib/boost_1.72.0/boost-build.jam:17: in module scope
make: *** [stan/lib/stan_math/lib/boost_1.72.0/stage/lib/libboost_program_options.a] Error 1
make: *** Waiting for unfinished jobs....

@rok-cesnovar
Copy link
Member Author

You dont neeed to set stan_mpi before rebuilding. You can just use it when compiling the model. Will look into why its failing on rebuild though.

@rok-cesnovar
Copy link
Member Author

As of 2.24 its no longer required to rebuild cmdstan upon setting the mpi/threads/opencl flags. If you set them for models, the main.o is rebuilt automatically.

@yizhang-yiz
Copy link

I don't think it's caused by rebuilding or STAN_MPI=1. With the following make/local the build fails.

CXX=mpicxx
TBB_CXX_TYPE=clang

@rok-cesnovar
Copy link
Member Author

Ok, let me check that.This error on make build doesnt actually prevent building and using models, its just a problem for stansummary. We do need to check it out.

Hopefully I can replicate on one of the machines I have access too and its not a macos specific problem.

@yizhang-yiz
Copy link

yizhang-yiz commented Nov 17, 2020

Let me see if I can reproduce the issue on ubuntu, one sec.

@rok-cesnovar rok-cesnovar marked this pull request as ready for review November 17, 2020 06:18
@yizhang-yiz
Copy link

I can confirm that linux(ubuntu) builds fine.

@rok-cesnovar
Copy link
Member Author

why do you say this?

I take that back. That obviously works fine.

I can confirm that linux(ubuntu) builds fine.

Thanks!

Apart from figuring out the build issue (which is a cmdstan issue anyways), the other question is how to set up the arguments. There are at least the following options:
a)

mod$sample(..., mpi_cmd = "mpiexec", mpi_nprocess = 5, mpi_args = c(...))

b)

mod$mpi_sample(..., mpi_cmd = "mpiexec", mpi_nprocess = 5, mpi_args = c(...))

c)

mod$sample(..., mpi_cmd = "mpiexec", mpi_args = c("-n", 4, ...))

d)

mod$mpi_sample(..., mpi_cmd = "mpiexec", mpi_args = c("-n", 4, ...))

mpi_cmd could also be an R option (set with (options("cmdstanr_mpi_cmd") or something), since its most likely going to be the standardized mpiexec and used in the majority, but some prefer mpirun. Or it can be a regular argument with the default "mpiexec" that most will not touch.

The other thing is if we should separate the -n/-np argument (mpi_nprocess in the above example, name still TBD) and the rest optional MPI arguments or just have those be one argument (mpi_args for example).

I do not think we actually need to separate MPI sampling in a separate function so I would go with either a) or c), slightly preferring c) because we do not have to deal with someone defining mpi_nprocess and -n in args.

@jgabry
Copy link
Member

jgabry commented Nov 17, 2020

slightly preferring c) because we do not have to deal with someone defining mpi_nprocess and -n in args.

If we don't have a separate method mpi_sample (or sample_mpi) then wouldn't we need an additional argument indicating whether to even try mpi? Or are you thinking that we would automatically try mpi if mpi_cmd and mpi_args are specified?

@mitzimorris What do you think about these options? If this is going to be implemented in CmdStanPy too then we should make sure to coordinate on this.

@yizhang-yiz
Copy link

I'd pick a) because frequently # of procs is all one supplies for mpi runs. It'll make args slightly cleaner.

@rok-cesnovar
Copy link
Member Author

If we don't have a separate method mpi_sample (or sample_mpi) then wouldn't we need an additional argument indicating whether to even try mpi?

There are the following options:

  • the model was compiled with MPI and the user specifies MPI args -> fine

  • the model was compiled with MPI and the user does not specify MPI args -> runs without MPI (singleton MPI), works fine (and should per MPI standard)

  • the model was not compiled with MPI and the user specifies MPI args -> also works, but this is just running the same chain N times, so its inneficient but not problematic.
    We can catch this and stop almost immediately or just leave it to complete. But we have this same problem even with a separate function. There would be a cleaner way if Add ./model compile_info  cmdstan#887 was fixed (still waiting for review in Add virtual function model_compile_info to the model_base.hpp stan#2932).

  • the model was not compiled with MPI and the user does not specify MPI args -> fine

I'd pick a) because frequently # of procs is all one supplies for mpi runs. It'll make args slightly cleaner.

Cool. You definitely have waaay more experience running these so I am definitely going to trust your opinion here. Count me in for a) as well.

@jgabry
Copy link
Member

jgabry commented Nov 17, 2020

Ok I'm also inclined to trust @yizhang-yiz's opinion since he has the most experience with MPI (I've never even tried using it!).

@mitzimorris What do you think about the proposed function signature?

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Nov 17, 2020

And the argument names would then be mpi_cmd (rarely used), mpi_n/mpi_np/mpi_nprocess (n/np isn't the most descriptive but is commonly know in the MPI world) and mpi_args (rarely used).

@yizhang-yiz
Copy link

What we need to put consideration is how this would interact with HPC task schedulers. Things like SLURM ask for the # of procs when a job is submitted, and this input should not conflict with what we put in a)(or c)). So maybe we can allow mpi_nprocess to be eclipsed. But then we'll need come up something informative to alert the user.

@jgabry
Copy link
Member

jgabry commented Nov 17, 2020

One reason I might slightly prefer a separate mpi_sample (or sample_mpi) method is because there seems to be a decent amount of documentation that we'll need just for the mpi stuff. So having a separate method could be much cleaner from a doc perspective (there's already a ton of doc for the existing sample method). That said, I suppose even if we just add arguments to the original sample method we could still have a separate doc page for mpi stuff. Not sure, but that's also something to consider.

@yizhang-yiz
Copy link

yizhang-yiz commented Nov 17, 2020

@jgabry has a point. Additional benefit in mpi_sample is that we'll be able to put complexities such as interaction cluster scheduler in a controlled environment, because I'm still not sure how map_rect or cross-chain warmup would evolve and how the user would be using them. If we do this, maybe a catch-all mpi_args=c("-n", 4,...) makes more sense. We can always add wrappers later on this bare-bone call.

@rok-cesnovar
Copy link
Member Author

hings like SLURM ask for the # of procs when a job is submitted

You mean like the user sets N but SLURM overrides that with M? Not sure if we can catch that in cmdstan(r). Apart from maybe running external commands?

seems to be a decent amount of documentation that we'll need just for the mpi stuff

Good point yeah. We can start with mod$mpi_sample() then and see how things evolve. threads_per_chain is obviously useless for mpi_sample(). How about parallel_chains? Would anyone start 4 chains with N processes?

Thanks for the input @yizhang-yiz !

@yizhang-yiz
Copy link

You mean like the user sets N but SLURM overrides that with M? Not sure if we can catch that in cmdstan(r).

I don't think we can, which is why we'll need hand the decision to the user so they can choose which arg to provide to cmdstanr, and if mpi_nprocess is not provided, we give message but don't intervene.

@jgabry
Copy link
Member

jgabry commented Dec 2, 2020

I just made a few small edits. This seems ready (thanks @rok-cesnovar) so approving now. But @yizhang-yiz if you have time can you try it out one more time and see if the doc is missing anything important?

@yizhang-yiz
Copy link

Sorry guys I missed the thread yesterday. I can play with it later today.

@jgabry
Copy link
Member

jgabry commented Dec 3, 2020

Thanks Yi!

@yizhang-yiz
Copy link

Works like a charm! Thank you! @rok-cesnovar @jgabry

@rok-cesnovar
Copy link
Member Author

Thank you for the insight, discussion, and testing!

Will then go ahead and merge. For now, this will be available in the Github version but we will most likely do a 0.2.3 or 0.3.0 release soon-ish.

@rok-cesnovar rok-cesnovar merged commit 291ceb3 into master Dec 4, 2020
@rok-cesnovar rok-cesnovar deleted the mpi branch December 4, 2020 10:37
@yizhang-yiz
Copy link

So I decided to try this on cross-chain warmup I was working on.

 library("cmdstanr")
  cmdstan_make_local(cpp_options = list("MPI_ADAPTED_WARMUP" = "1", "TBB_CXX_TYPE"="clang"))
  rebuild_cmdstan()
  mod <- cmdstan_model("cmdstan/examples/eight_schools.stan", quiet=FALSE, force_recompile=TRUE)
  f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools")

Output:

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 iteration: 100 window: 1 / 1 Rhat: 1.0289 ESS: 149.4547 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 iteration: 200 window: 1 / 2 Rhat: 1.0006 ESS: 373.0270 
Chain 1 iteration: 200 window: 2 / 2 Rhat: 1.0002 ESS: 233.4405 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Error: Supplied CSV file is corrupt!

The warmup algorithm works well, but looks like cmdstan's I/O is taken over by R, as a cmdstan run would return 4 CSV files from 4 communicating chains but in working directory there's only one empty CSV. Any idea what's going on? @rok-cesnovar

@rok-cesnovar
Copy link
Member Author

Can you specify validate_csv = FALSE and see if the CSVs remain without the error. The CSVs might get deleted because of the error.

Will take a look, one of the reasons I was so keen on getting this MPI in cmdstanr was testing this cross-chain warmup:

@yizhang-yiz
Copy link

"validate_csv=false" doesn't help. The output remains to be a single empty CSV files while I expect 4 files, one for each chain. The easiest way to access this experimental feature is to use Torsten repo:
https://github.com/metrumresearchgroup/Torsten

and set cmdstan path to Torsten/cmdstan.

@rok-cesnovar
Copy link
Member Author

Ok, thanks, will take a look now. How do the files differ? By suffix? Say I give output file = test.csv

@yizhang-yiz
Copy link

How do the files differ?

I don't see anything different:

> f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools",validate_csv=FALSE)
Running MCMC with 1 chain...
...
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
> f$output_files()
[1] "/Users/yiz/Work/Torsten/cmdstan/examples/eight_schools/eight_schools-202012080911-1-818735.csv"
> f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools")
Running MCMC with 1 chain...
...
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.1 seconds.
Error: Supplied CSV file is corrupt!
> f$output_files()
[1] "/Users/yiz/Work/Torsten/cmdstan/examples/eight_schools/eight_schools-202012080911-1-818735.csv"

While in cmdstan I'm getting

bash-3.2$ make -j4 examples/eight_schools/eight_schools && cd examples/eight_schools/
bash-3.2$ mpiexec -n 4 -l ./eight_schools sample data file=eight_schools.data.R
...
[0]  Elapsed Time: 0.016 seconds (Warm-up)
[0]                0.07 seconds (Sampling)
[0]                0.086 seconds (Total)
[0] 
bash-3.2$ ls *.csv
mpi.0.output.csv	mpi.1.output.csv	mpi.2.output.csv	mpi.3.output.csv

@rok-cesnovar
Copy link
Member Author

Thanks, I was asking for mpi.0.output.csv mpi.1.output.csv mpi.2.output.csv mpi.3.output.csv but worded it weirdly.

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Dec 8, 2020

This is the command that is used

mpiexec -n 4 /home/rok/Desktop/test_models/eight_schools/eight_schools 'id=1' random 'seed=208921277' data 'file=/home/rok/Desktop/test_models/eight_schools/schools.data.json' output 'file=/home/rok/Desktop/test_models/eight_schools/eight_schools-202012081831-1-6f6627.csv' 'refresh=200' 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1'

Running this in the command line produces the same thing (1 CSV).

@rok-cesnovar
Copy link
Member Author

You can install remotes::install_github("stan-dev/cmdstanr@echomd") that will print the command.

@yizhang-yiz
Copy link

Great! Thanks. Then that's likely caused by bugs in my code.

@rok-cesnovar
Copy link
Member Author

I am not so sure just yet.

@yizhang-yiz
Copy link

I can confirm replacing output file in the above command with output 'file=eight_schools-202012081831-1-6f6627.csv' makes it work:

bash-3.2$ mpiexec -n 4 ./eight_schools 'id=1' random 'seed=208921277' data 'file=eight_schools.data.R' output 'file=eight_schools-202012081831-1-6f6627.csv' 'refresh=200' 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt 'engaged=1' &> out.log
bash-3.2$ ls *.csv
mpi.0.eight_schools-202012081831-1-6f6627.csv	mpi.2.eight_schools-202012081831-1-6f6627.csv
mpi.1.eight_schools-202012081831-1-6f6627.csv	mpi.3.eight_schools-202012081831-1-6f6627.csv

so I must have messed up the ostream path.

@rok-cesnovar
Copy link
Member Author

I would say this is the culprit yes if the file is specified with absolute paths:

data
  file = /home/rok/Desktop/test_models/eight_schools/schools.data.json
init = 2 (Default)
random
  seed = 106570872
output
  file = mpi.0./home/rok/Desktop/test_models/eight_schools/eight_schools-202012081852-1-13cb4f.csv
  diagnostic_file =  (Default)
  refresh = 200
  sig_figs = -1 (Default)

@rok-cesnovar rok-cesnovar mentioned this pull request Dec 8, 2020
@yizhang-yiz
Copy link

yizhang-yiz commented Dec 8, 2020

Just fixed it, now it works

f <- mod$sample_mpi(data = "cmdstan/examples/eight_schools/eight_schools.data.R", chains = 1, mpi_args = list("n" = 4), refresh = 200,output_dir="cmdstan/examples/eight_schools",validate_csv=FALSE)
Running MCMC with 1 chain...

Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 1 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.1.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1       num_cross_chains = 4 (Default) 
Chain 1       cross_chain_window = 100 (Default) 
Chain 1       cross_chain_rhat = 1.05 (Default) 
Chain 1       cross_chain_ess = 200 (Default) 
Chain 1     algorithm = hmc (Default) 
Chain 1       hmc 
Chain 1         engine = nuts (Default) 
Chain 1           nuts 
Chain 1             max_depth = 10 (Default) 
Chain 1         metric = diag_e (Default) 
Chain 1         metric_file =  (Default) 
Chain 1         stepsize = 1 (Default) 
Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 2 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.2.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1       t0 = 10 (Default) 
Chain 1       init_buffer = 75 (Default) 
Chain 1       term_buffer = 50 (Default) 
Chain 1       window = 25 (Default) 
Chain 1       num_cross_chains = 4 (Default) 
Chain 1       cross_chain_window = 100 (Default) 
Chain 1       cross_chain_rhat = 1.05 (Default) 
Chain 1       cross_chain_ess = 200 (Default) 
Chain 1     algorithm = hmc (Default) 
Chain 1       hmc 
Chain 1         engine = nuts (Default) 
Chain 1           nuts 
Chain 1             max_depth = 10 (Default) 
Chain 1         metric = diag_e (Default) 
Chain 1         metric_file =  (Default) 
Chain 1         stepsize = 1 (Default) 
Chain 1         stepsize_jitter = 0 (Default) 
Chain 1 id = 3 
Chain 1 data 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools.data.R 
Chain 1 init = 2 (Default) 
Chain 1 random 
Chain 1   seed = 604574839 
Chain 1 output 
Chain 1   file = /Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.mpi.3.csv 
Chain 1   diagnostic_file =  (Default) 
Chain 1   refresh = 200 
Chain 1   sig_figs = -1 (Default) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 iteration: 100 window: 1 / 1 Rhat: 1.0212 ESS: 79.6405 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 iteration: 200 window: 1 / 2 Rhat: 1.0181 ESS: 170.9020 
Chain 1 iteration: 200 window: 2 / 2 Rhat: 1.0141 ESS: 135.0918 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 iteration: 300 window: 1 / 3 Rhat: 1.0104 ESS: 310.1229 
Chain 1 iteration: 300 window: 2 / 3 Rhat: 1.0052 ESS: 275.0363 
Chain 1 iteration: 300 window: 3 / 3 Rhat: 1.0006 ESS: 144.5106 
Chain 1 cross-chain adaptation time: 0 seconds 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
...
Chain 1 finished in 0.1 seconds.

Though there's still a dummy csv file generated and pointed to by output_files():

f$output_files()
[1] "/Users/yiz/Work/cmdstan/examples/eight_schools/eight_schools-202012081145-1-3326b5.csv"

In addition, what's the best way to add custom options to sample_mpi call for the following?

mpiexec -n 4 ./eight_schools sample adapt cross_chain_ess=400 data file...

Here cross_chain_ess is an extra option under adapt family.

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Dec 8, 2020

Just fixed it, now it works

Yay!

Though there's still a dummy CSV file generated and pointed to by

Hm, that would probably be because of https://github.com/stan-dev/cmdstanr/blob/master/R/args.R#L99
I am not actually sure we need that but would have to check.

In addition, what's the best way to add custom options to sample_mpi call for the following?

See commit dbee414 and just duplicate for other args :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI execution
5 participants