Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel processing hangs up on run_swat2012 #38

Closed
seyounger opened this issue Jul 30, 2021 · 10 comments
Closed

parallel processing hangs up on run_swat2012 #38

seyounger opened this issue Jul 30, 2021 · 10 comments

Comments

@seyounger
Copy link

I was previously running many swat 2012 iterations on multiple cores without issue, but now my workers are getting hung up. This happens with the demo project as well as my projects. Sometime I get errors like "forrtl: The operation cannot be performed on a file with a user-mapped section open." Other times there is no error it just gets stuck in an infinite loop and the workers keep running even after closing R.

For example, the run below gets hung about 30 percent of the times that I run it. Usually says performing simulation 308 of 400 but never advances, sometimes it hangs on the last run. Hang ups seem more frequent with more complicated parameter sets.

par_single <- tibble("CN2.mgt|change = absval" = runif(400, 35, 98))

q_fast_2012 <- run_swat2012(project_path = demo_path2012,
                           output = define_output(file = "rch",
                                                  variable = "FLOW_OUT",
                                                  unit = 1),
                           parameter = par_single,
                            n_thread = 4)

image

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] sensitivity_1.26.0 lubridate_1.7.10 fast_0.64 sf_1.0-1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[9] readr_2.0.0 tidyr_1.1.3 tibble_3.1.3 ggplot2_3.3.5
[13] tidyverse_1.3.1 hydroGOF_0.4-0 zoo_1.8-9 SWATplusR_0.3.5.1

loaded via a namespace (and not attached):
[1] fs_1.5.0 xts_0.12.1 bit64_4.0.5 httr_1.4.2
[5] tools_4.1.0 backports_1.2.1 utf8_1.2.2 R6_2.5.0
[9] KernSmooth_2.23-20 DBI_1.1.1 colorspace_2.0-2 withr_2.4.2
[13] sp_1.4-5 tidyselect_1.1.1 processx_3.5.2 hydroTSM_0.6-0
[17] bit_4.0.4 compiler_4.1.0 automap_1.0-14 cli_3.0.1
[21] rvest_1.0.1 gstat_2.0-7 xml2_1.3.2 scales_1.1.1
[25] classInt_0.4-3 proxy_0.4-26 digest_0.6.27 foreign_0.8-81
[29] rmarkdown_2.9 pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.1.1
[33] fastmap_1.1.0 rlang_0.4.11 readxl_1.3.1 numbers_0.8-2
[37] rstudioapi_0.13 RSQLite_2.2.7 FNN_1.1.3 generics_0.1.0
[41] jsonlite_1.7.2 magrittr_2.0.1 Rcpp_1.0.7 munsell_0.5.0
[45] fansi_0.5.0 lifecycle_1.0.0 stringi_1.7.3 yaml_2.2.1
[49] plyr_1.8.6 grid_4.1.0 maptools_1.1-1 blob_1.2.2
[53] parallel_4.1.0 crayon_1.4.1 doSNOW_1.0.19 lattice_0.20-44
[57] haven_2.4.1 hms_1.1.0 knitr_1.33 ps_1.6.0
[61] pillar_1.6.1 boot_1.3-28 spacetime_1.2-5 codetools_0.2-18
[65] reprex_2.0.0 glue_1.4.2 evaluate_0.14 modelr_0.1.8
[69] vctrs_0.3.8 tzdb_0.1.2 foreach_1.5.1 cellranger_1.1.0
[73] gtable_0.3.0 reshape_0.8.8 assertthat_0.2.1 cachem_1.0.5
[77] xfun_0.24 broom_0.7.9 e1071_1.7-7 class_7.3-19
[81] snow_0.4-3 intervals_0.15.2 iterators_1.0.13 memoise_2.0.0
[85] units_0.7-2 ellipsis_0.3.2

@seyounger seyounger changed the title parallel processing hangs up on run_swat_2012 parallel processing hangs up on run_swat2012 Jul 30, 2021
@chrisschuerz
Copy link
Owner

Hi @seyounger This error is new to me. A few things that I could think of:

  • If you try to re-run the same thread folders too quickly after cancelling an R session the old processes can be still active and block the new session (this was the case with older R versions e.g. 2.x but I did not encounter that for a while and also the messages where different then).
  • Your win10 Antivirus could block the dll. That at least I saw in some stack overflow threads when looking through. This I cannot confirm. But what I usually do is to run the SWAT projects from a drive or folder that is not tracked by anti virus.
  • newer version of R package dependencies cause that behavior. By just quickly skimming through I saw that some of the R packages for parallel computing that you use are more up to date than mine. I can check that by updating all packages.

The first two points you could check, the last one I could give a try and see if I can reproduce your issue. Otherwise it will be difficult for me to find a solution.

@seyounger
Copy link
Author

Thanks, @chrisschuerz, this is helpful.

Can you tell me what versions of packages SWATplusR is built on? I tried to find that info but didn't know where to look.

@chrisschuerz
Copy link
Owner

In the R package DESCRIPTION file I usually add the version that I built and tested the R package with. This is to avoid that users use much older versions of the R package dependencies. But of course it could happen that newer versions also cause a mess up. I will update my packages and see if I run into troubles as well.

@chrisschuerz
Copy link
Owner

@seyounger A quick update, I just updated all my R package dependencies and ran the demo model with 8 parameter combinations on 4 cores. I indeed encountered an issue with the updated R packages. It was unfortunately a different one compared to what you described above. So I am not sure if the fix I just did to solve the issue that I found is any halpful to you.

@seyounger
Copy link
Author

Thanks for checking. I installed the updated version but that didn't help. Then I set all my packages to the same or as close to the Description version as possible but I'm still experiencing hang ups with no message at all it just stops. I'm now having the same issue with single core optimization runs as well. They will go for several hundred runs but hang before completion. Let me know if there is anything else I can do to help find the problem. Here is my current session info.

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hydromad_0.9-26     reshape_0.8.8       polynom_1.4-0       latticeExtra_0.6-29
 [5] lattice_0.20-41     sensitivity_1.26.0  purrr_0.3.4         forcats_0.5.1      
 [9] fast_0.64           plotly_4.9.4.1      beepr_1.3           lubridate_1.7.10   
[13] ggrepel_0.9.1       sf_1.0-2            ggplot2_3.3.5       dplyr_1.0.7        
[17] tidyr_1.1.3         hydroGOF_0.4-0      zoo_1.8-9           SWATplusR_0.3.5.1  

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2   ellipsis_0.3.2     class_7.3-18       rio_0.5.27        
  [5] htmlTable_2.2.1    numbers_0.8-2      base64enc_0.1-3    rstudioapi_0.13   
  [9] proxy_0.4-26       audio_0.1-7        bit64_4.0.5        fansi_0.5.0       
 [13] codetools_0.2-18   splines_4.0.5      cachem_1.0.6       knitr_1.33        
 [17] Formula_1.2-4      jsonlite_1.7.2     cluster_2.1.1      dbplyr_2.1.1      
 [21] png_0.1-7          readr_2.0.1        compiler_4.0.5     httr_1.4.2        
 [25] backports_1.2.1    assertthat_0.2.1   Matrix_1.3-2       fastmap_1.1.0     
 [29] lazyeval_0.2.2     cli_3.0.1          htmltools_0.5.1.1  tools_4.0.5       
 [33] gtable_0.3.0       glue_1.4.2         Rcpp_1.0.7         carData_3.0-4     
 [37] cellranger_1.1.0   vctrs_0.3.8        iterators_1.0.13   xfun_0.25         
 [41] stringr_1.4.0      ps_1.6.0           openxlsx_4.2.4     lifecycle_1.0.0   
 [45] scales_1.1.1       gstat_2.0-7        hms_1.1.0          doSNOW_1.0.19     
 [49] parallel_4.0.5     RColorBrewer_1.1-2 curl_4.3.2         gridExtra_2.3     
 [53] memoise_2.0.0      rpart_4.1-15       stringi_1.7.3      RSQLite_2.2.8     
 [57] maptools_1.1-1     foreach_1.5.1      checkmate_2.0.0    e1071_1.7-8       
 [61] boot_1.3-27        zip_2.2.0          intervals_0.15.2   rlang_0.4.11      
 [65] pkgconfig_2.0.3    htmlwidgets_1.5.3  bit_4.0.4          processx_3.5.2    
 [69] tidyselect_1.1.1   hydroTSM_0.6-0     plyr_1.8.6         magrittr_2.0.1    
 [73] R6_2.5.1           snow_0.4-3         generics_0.1.0     Hmisc_4.5-0       
 [77] automap_1.0-14     DBI_1.1.1          pillar_1.6.2       haven_2.4.3       
 [81] foreign_0.8-81     withr_2.4.2        units_0.7-2        xts_0.12.1        
 [85] nnet_7.3-15        survival_3.2-10    abind_1.4-5        sp_1.4-5          
 [89] tibble_3.1.3       spacetime_1.2-5    crayon_1.4.1       car_3.0-11        
 [93] KernSmooth_2.23-18 utf8_1.2.2         tzdb_0.1.2         jpeg_0.1-9        
 [97] grid_4.0.5         readxl_1.3.1       data.table_1.14.0  blob_1.2.2        
[101] FNN_1.1.3          digest_0.6.27      classInt_0.4-3     munsell_0.5.0     
[105] viridisLite_0.4.0 

@EdbertoLima
Copy link
Contributor

EdbertoLima commented Aug 31, 2021

Hello @seyounger,

I am having the same issue, I believe the problem is one of the packages that that works with parallelization of the process - mine just stopped after few updates, and even doing downgrade of the package I could not find a solution.

To solve temporarily my problem I switch to Microsoft R Open 4.0.2, and so far so good. I do not know to explain the technical details, but I hope you can also find a solution for your problem.

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 tibble_3.1.4 ggplot2_3.3.5
[6] tidyverse_1.3.0 SWATplusR_0.3.5.1 readr_1.3.1 dplyr_1.0.7 tidyr_1.1.3
[11] lubridate_1.7.10 lhs_1.1.1 hydroGOF_0.4-0 zoo_1.8-9 pasta_0.1.0
[16] here_1.0.1 RevoUtils_11.0.2 RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] fs_1.5.0 xts_0.12.1 bit64_4.0.5 RColorBrewer_1.1-2 httr_1.4.2 rprojroot_2.0.2
[7] tools_4.0.2 backports_1.2.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.1 colorspace_2.0-2
[13] withr_2.4.2 sp_1.4-5 tidyselect_1.1.1 processx_3.4.5 hydroTSM_0.6-0 bit_4.0.4
[19] compiler_4.0.2 automap_1.0-14 cli_3.0.1 rvest_1.0.1 gstat_2.0-7 xml2_1.3.2
[25] scales_1.1.1 proxy_0.4-26 digest_0.6.27 foreign_0.8-80 rmarkdown_2.10 pkgconfig_2.0.3
[31] htmltools_0.5.2 dbplyr_1.4.4 fastmap_1.1.0 rlang_0.4.11 readxl_1.3.1 rstudioapi_0.13
[37] RSQLite_2.2.8 FNN_1.1.3 generics_0.1.0 jsonlite_1.7.2 magrittr_2.0.1 Rcpp_1.0.7
[43] munsell_0.5.0 fansi_0.5.0 lifecycle_1.0.0 stringi_1.7.4 yaml_2.2.1 plyr_1.8.6
[49] grid_4.0.2 maptools_1.1-1 blob_1.2.2 parallel_4.0.2 crayon_1.4.1 doSNOW_1.0.16
[55] lattice_0.20-41 haven_2.4.3 hms_1.1.0 knitr_1.33 ps_1.6.0 pillar_1.6.2
[61] spacetime_1.2-5 codetools_0.2-16 reprex_2.0.1 glue_1.4.2 evaluate_0.14 modelr_0.1.8
[67] vctrs_0.3.8 foreach_1.4.4 cellranger_1.1.0 gtable_0.3.0 reshape_0.8.8 assertthat_0.2.1
[73] cachem_1.0.6 xfun_0.25 broom_0.7.9 e1071_1.7-8 class_7.3-17 snow_0.4-3
[79] intervals_0.15.2 iterators_1.0.13 tinytex_0.33 memoise_2.0.0 ellipsis_0.3.2

@seyounger
Copy link
Author

seyounger commented Sep 2, 2021

Thanks @EdbertoLima

I tried your method but was unable to install SWATplusR under that version of MRO due to dependency incompatibility processx 3.4.5 is required but only 3.4.3 is available.

Screenshot 2021-09-02 110003

@seyounger seyounger reopened this Sep 2, 2021
@seyounger
Copy link
Author

Thanks @EdbertoLima

I tried your method but was unable to install SWATplusR under that version of RRO due to dependency incompatibility processx 3.4.5 is required but only 3.4.3 is available.

Screenshot 2021-09-02 110003

Okay, I got processx 3.4.5 installed by downloading the tar.gz from the archive and am running under RRO 4.0.2. I'll report back if this works for me or not.

@seyounger
Copy link
Author

The workaround from @EdbertoLima to use MRO 4.0.2 seems to work. I've tested it successfully on 3 windows machines. Now that I have a working library I have archived it just in case anything changes and would recommend anyone with a working library to back it up because it may not stay that way. I'm leaving this issue open because this is a workaround not a fix. I wish I understood the code structure better to help troubleshoot.

@chrisschuerz
Copy link
Owner

chrisschuerz commented Nov 15, 2021

@seyounger and @EdbertoLima sorry that it took me so long to figure out what is going on with paralel processing. It took me already a while to systematically trigger the issue that I can work on it. But as I now updated R to version 4.1.2 and hence also updated all R packages I was able to trigger the issue very regularly.
The issue is caused by read and write commands from the readr package (in my case version 2.0.1). Particularly the functions read_fwf(), read_lines(), and write_lines() resulted in locking the access to the files that were written or read. For output files this meant that the actual SWAT run was not able to overwrite e.g. the output.rch as the file was locked in the previous iteration and the SWAT run cannot access the file to write the outputs. When input files were locked they cannot be rewritten, which caused one or some of the parallel threads to get stuck. There is some info on that here and here.
The readr::read_* have the option to disable lazy reading which solves the problem on Windows. The readr::write_line() is now replaced by base::writeLines which is a bit slower but solves the issue.
You can now test the fix when installing from the development branch fix_#38 with the command remotes::install_github('chrisschuerz/SWATplusR', ref = 'fix_#38'). I will soon integrate this fix into the main branch as the new SWATplusR version 0.5.

Edit: New SWATplusR 0.5 on the main branch includes now the changes of fix_#38.
Hope this was really the complete fix for that issue. Please keep me posted on what worked and what didn't.

chrisschuerz added a commit that referenced this issue Nov 15, 2021
Issue #38 was introduced with an updated of `readr` to
version 2.0. `readr` 2.0 introduced lazy reading, that
locked input and output files on Windows. This resulted
in hang up of simulation.
lazy = FALSE was set in readr::read_lines and
readr::read_fwf, readr::read_lines was replaced with
base::readLines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants