Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] R package crashes on windows when loaded together with {fansi} or anything that depends on it #4464

Closed
dfalbel opened this issue Jul 10, 2021 · 25 comments · Fixed by #4494 or #4496
Assignees

Comments

@dfalbel
Copy link

dfalbel commented Jul 10, 2021

This is probably related to:

Description

Using lightgbm while parsnip is loaded crashes the R session with: Exited with status -1073741819.

Reproducible example

Calling:

library(parsnip)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
       params = list(
       objective = "regression", 
       metric = "l2"
       ) , 
data = dtrain
)

Environment info

I am using the dev version of LightGBM as suggested in #4007 (comment)
The error only occurs on Windows.

Here's a GitHub actions run that shows the behavior.
This shows that it works fine if parsnip is not loaded: https://github.com/curso-r/treesnip/runs/3037580458?check_suite_focus=true#step:9:1
And this one shows the error message: https://github.com/curso-r/treesnip/runs/3037580458?check_suite_focus=true#step:10:21

I could also reproduce it locally on a Windows machine, but I am not sure what's the best way to get a stack trace.
Let me know if I can help with further debugging.

@jameslamb
Copy link
Collaborator

Thanks for the report and for using LightGBM @dfalbel !

I'll look into this as soon as possible.

@jameslamb jameslamb changed the title R package crashes on windows when loaded together with parsnip [R-package] R package crashes on windows when loaded together with parsnip Jul 11, 2021
@shiyu1994
Copy link
Collaborator

Thanks for reporting that!
I tested the example on my win10 machine, but failed to reproduce the error with the latest master of LightGBM. The script runs successfully and gives the correct output.

D:\Projects\Test-LightGBM\issues\4464>Rscript test.R
Loading required package: R6
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000203 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000235 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000233 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 214
[LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 107
[LightGBM] [Info] Start training from score 0.479503
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Start training from score 0.486872
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] Start training from score 0.479963
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]:  valid's l2:0.20319+0.000285721"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[2]:  valid's l2:0.165525+0.000574973"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[3]:  valid's l2:0.134908+0.000749561"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[4]:  valid's l2:0.110093+0.000922039"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[5]:  valid's l2:0.0899506+0.00102425"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[6]:  valid's l2:0.0736391+0.00109707"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[7]:  valid's l2:0.0603161+0.00110421"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[8]:  valid's l2:0.0495564+0.0011122"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[9]:  valid's l2:0.0407735+0.00109775"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]:  valid's l2:0.0334856+0.000970481"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[11]:  valid's l2:0.0275363+0.000849414"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[12]:  valid's l2:0.0226964+0.000810414"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[13]:  valid's l2:0.0187499+0.000778224"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[14]:  valid's l2:0.0155632+0.000752949"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[15]:  valid's l2:0.0129092+0.000674334"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[16]:  valid's l2:0.0107217+0.00059569"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[17]:  valid's l2:0.00896862+0.000529726"
[1] "[18]:  valid's l2:0.00752793+0.000477273"
[1] "[19]:  valid's l2:0.00635069+0.000426635"
[1] "[20]:  valid's l2:0.00538788+0.000374957"
[1] "[21]:  valid's l2:0.00457729+0.000358718"
[1] "[22]:  valid's l2:0.00392614+0.000342183"
[1] "[23]:  valid's l2:0.00336796+0.000321502"
[1] "[24]:  valid's l2:0.00293595+0.000296263"
[1] "[25]:  valid's l2:0.00256147+0.000286571"
[1] "[26]:  valid's l2:0.00226154+0.000270801"
[1] "[27]:  valid's l2:0.00200893+0.000267164"
[1] "[28]:  valid's l2:0.00180272+0.000258289"
[1] "[29]:  valid's l2:0.00162937+0.000243161"
[1] "[30]:  valid's l2:0.00147082+0.000247366"
[1] "[31]:  valid's l2:0.00135177+0.000243783"
[1] "[32]:  valid's l2:0.00123631+0.000236365"
[1] "[33]:  valid's l2:0.00115095+0.000232943"
[1] "[34]:  valid's l2:0.00108048+0.000230003"
[1] "[35]:  valid's l2:0.00100614+0.000221995"
[1] "[36]:  valid's l2:0.000946318+0.000225411"
[1] "[37]:  valid's l2:0.000897226+0.000226746"
[1] "[38]:  valid's l2:0.00083996+0.000221873"
[1] "[39]:  valid's l2:0.00080131+0.000214173"
[1] "[40]:  valid's l2:0.000766308+0.000204982"
[1] "[41]:  valid's l2:0.000739083+0.000206319"
[1] "[42]:  valid's l2:0.000703267+0.000217342"
[1] "[43]:  valid's l2:0.000665415+0.000219365"
[1] "[44]:  valid's l2:0.000628061+0.000211541"
[1] "[45]:  valid's l2:0.000592348+0.000206586"
[1] "[46]:  valid's l2:0.000559744+0.000202899"
[1] "[47]:  valid's l2:0.000523506+0.000197195"
[1] "[48]:  valid's l2:0.00049955+0.000193268"
[1] "[49]:  valid's l2:0.000474157+0.00019095"
[1] "[50]:  valid's l2:0.000456001+0.000187329"
[1] "[51]:  valid's l2:0.000435425+0.000185016"
[1] "[52]:  valid's l2:0.000418417+0.000177465"
[1] "[53]:  valid's l2:0.000406039+0.000170052"
[1] "[54]:  valid's l2:0.000389491+0.000166786"
[1] "[55]:  valid's l2:0.00037612+0.000163033"
[1] "[56]:  valid's l2:0.000366619+0.000158277"
[1] "[57]:  valid's l2:0.000352018+0.000152857"
[1] "[58]:  valid's l2:0.00034011+0.000147394"
[1] "[59]:  valid's l2:0.000328484+0.000141576"
[1] "[60]:  valid's l2:0.000315826+0.000136087"
[1] "[61]:  valid's l2:0.000306009+0.000131264"
[1] "[62]:  valid's l2:0.000295355+0.000126489"
[1] "[63]:  valid's l2:0.000285643+0.000121594"
[1] "[64]:  valid's l2:0.000274675+0.000114519"
[1] "[65]:  valid's l2:0.00026645+0.000109754"
[1] "[66]:  valid's l2:0.000257835+0.000106091"
[1] "[67]:  valid's l2:0.000248925+0.000100773"
[1] "[68]:  valid's l2:0.00024091+9.73169e-05"
[1] "[69]:  valid's l2:0.00023334+9.3604e-05"
[1] "[70]:  valid's l2:0.00022485+8.88266e-05"
[1] "[71]:  valid's l2:0.000218256+8.58562e-05"
[1] "[72]:  valid's l2:0.000210262+8.06131e-05"
[1] "[73]:  valid's l2:0.000204809+7.71541e-05"
[1] "[74]:  valid's l2:0.000198144+7.26759e-05"
[1] "[75]:  valid's l2:0.000192143+7.10996e-05"
[1] "[76]:  valid's l2:0.000185914+6.73203e-05"
[1] "[77]:  valid's l2:0.000180159+6.46153e-05"
[1] "[78]:  valid's l2:0.000175122+6.20043e-05"
[1] "[79]:  valid's l2:0.000169991+5.93545e-05"
[1] "[80]:  valid's l2:0.000165344+5.86973e-05"
[1] "[81]:  valid's l2:0.000160885+5.60808e-05"
[1] "[82]:  valid's l2:0.00015688+5.39479e-05"
[1] "[83]:  valid's l2:0.000152405+5.12417e-05"
[1] "[84]:  valid's l2:0.000148674+4.95601e-05"
[1] "[85]:  valid's l2:0.000144452+4.74966e-05"
[1] "[86]:  valid's l2:0.000140023+4.55681e-05"
[1] "[87]:  valid's l2:0.000135932+4.28883e-05"
[1] "[88]:  valid's l2:0.000131253+4.12862e-05"
[1] "[89]:  valid's l2:0.000127097+3.83581e-05"
[1] "[90]:  valid's l2:0.000123491+3.69218e-05"
[1] "[91]:  valid's l2:0.000119873+3.54353e-05"
[1] "[92]:  valid's l2:0.000116105+3.4529e-05"
[1] "[93]:  valid's l2:0.000113005+3.29312e-05"
[1] "[94]:  valid's l2:0.000110071+3.14197e-05"
[1] "[95]:  valid's l2:0.000107318+3.01238e-05"
[1] "[96]:  valid's l2:0.00010479+2.94182e-05"
[1] "[97]:  valid's l2:0.000102076+2.87784e-05"
[1] "[98]:  valid's l2:0.000100284+2.76207e-05"
[1] "[99]:  valid's l2:9.81008e-05+2.6275e-05"
[1] "[100]:  valid's l2:9.56005e-05+2.55846e-05"

So I think more details about the versions of R, RTools can be helpful to identify the cause.

@dfalbel
Copy link
Author

dfalbel commented Jul 20, 2021

Hi @shiyu1994 thanks for taking a look at this.

Here's the sessionInfo() of the system I can reproduce the error:

Loading required package: R6
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0          parsnip_0.1.6    

loaded via a namespace (and not attached):
 [1] lattice_0.20-44   tidyr_1.1.3       fansi_0.5.0       utf8_1.2.1       
 [5] crayon_1.4.1      dplyr_1.0.7       grid_4.1.0        jsonlite_1.7.2   
 [9] lifecycle_1.0.0   magrittr_2.0.1    pillar_1.6.1      rlang_0.4.11     
[13] data.table_1.14.0 Matrix_1.3-3      vctrs_0.3.8       generics_0.1.0   
[17] ellipsis_0.3.2    tools_4.1.0       glue_1.4.2        purrr_0.3.4      
[21] compiler_4.1.0    pkgconfig_2.0.3   tidyselect_1.1.1  tibble_3.1.2 

This is using master lightgbm too.
Here's a link to the GHA run that reproduces the failure: https://github.com/curso-r/treesnip/runs/3116229035?check_suite_focus=true

@dfsnow
Copy link

dfsnow commented Jul 22, 2021

I'm seeing the same issue. I'm guessing this may be related to #4007 and #4259. Some further details:

No crash

Running a clean install of the script below in a new project with renv enabled works for 3.2.1.99. See sessionInfo() below.

library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0      

loaded via a namespace (and not attached):
[1] compiler_4.1.0    Matrix_1.3-3      tools_4.1.0       grid_4.1.0        data.table_1.14.0
[6] jsonlite_1.7.2    renv_0.13.2       lattice_0.20-44  

Installing parsnip and loading it after lightgbm likewise does not result in a crash.

library(lightgbm)
library(parsnip)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] parsnip_0.1.7  lightgbm_3.2.1.99 R6_2.5.0      

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.1  lattice_0.20-44   rlang_0.4.11      fansi_0.5.0      
 [6] dplyr_1.0.7       tools_4.1.0       hardhat_0.1.6     grid_4.1.0        data.table_1.14.0
[11] utf8_1.2.1        ellipsis_0.3.2    tibble_3.1.2      lifecycle_1.0.0   crayon_1.4.1     
[16] Matrix_1.3-3      purrr_0.3.4       tidyr_1.1.3       vctrs_0.3.8       glue_1.4.2       
[21] compiler_4.1.0    pillar_1.6.1      generics_0.1.0    jsonlite_1.7.2    renv_0.13.2      
[26] pkgconfig_2.0.3 

Crash

However, loading parsnip before lightgbm results in a crash at the lgb.cv step.

library(parsnip)
library(lightgbm)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0       parsnip_0.1.7 

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.1  lattice_0.20-44   rlang_0.4.11      fansi_0.5.0      
 [6] dplyr_1.0.7       tools_4.1.0       parallel_4.1.0    hardhat_0.1.6     grid_4.1.0       
[11] data.table_1.14.0 utf8_1.2.1        ellipsis_0.3.2    tibble_3.1.2      lifecycle_1.0.0  
[16] crayon_1.4.1      Matrix_1.3-3      purrr_0.3.4       tidyr_1.1.3       vctrs_0.3.8      
[21] glue_1.4.2        compiler_4.1.0    pillar_1.6.1      generics_0.1.0    jsonlite_1.7.2   
[26] renv_0.13.2       pkgconfig_2.0.3  

Notes

  • Once lightgbm has crashed once due to parsnip, it crashes permanently for me regardless of whether or not parsnip is loaded again (even the first script does not work again after a crash).
  • Reinstalling lightgbm via renv::install("lightgbm", rebuild = TRUE) seems to fix this problem for both the CRAN and GitHub versions.

Edit

Did a quick trip through the Imports of parsnip, loading each library before lightgbm 1-by-1. The following libraries cause crashes:

dplyr (1.0.7)
hardhat (0.1.6)
tibble (3.1.2)
tidyr (1.1.3)

While the following cause no issues:

generics (0.1.0)
globals (0.14.0)
glue (1.4.2)
lifecycle (1.0.0)
magrittr (2.0.1)
prettyunits (1.1.1)
purrr (0.3.4)
rlang (0.4.11)
stats
utils
vctrs (0.3.8)

I then traveled through the dependencies of tibble and dplyr to find the lowest level library call that will cause a crash. Seems like fansi may be the actual culprit. The script below causes a crash for me in a fresh environment with lightgbm 3.2.1 (from CRAN) and 3.2.1.99 (from GitHub)

library(fansi)
library(lightgbm)

data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression"
    , metric = "l2"
  )
  , data = dtrain
)  
Session Info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0       fansi_0.5.0   

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    tidyselect_1.1.1  lattice_0.20-44   rlang_0.4.11      stringr_1.4.0    
 [6] dplyr_1.0.6       tools_4.1.0       grid_4.1.0        parallel_4.1.0    data.table_1.14.0
[11] audio_0.1-7       utf8_1.2.1        DBI_1.1.1         ellipsis_0.3.2    assertthat_0.2.1 
[16] tibble_3.1.2      lifecycle_1.0.0   crayon_1.4.1      Matrix_1.3-3      beepr_1.3        
[21] purrr_0.3.4       vctrs_0.3.8       glue_1.4.2        ccao_0.5.1        stringi_1.6.2    
[26] compiler_4.1.0    pillar_1.6.1      generics_0.1.0    jsonlite_1.7.2    pkgconfig_2.0.3 

@jameslamb
Copy link
Collaborator

Thanks to everyone participating for your help and investigation!

I am planning to test some theories about this tomorrow when I have some time and easy access to a Windows environment.


Here's a link to the GHA run that reproduces the failure

@dfalbel , I looked at the definition of that GHA job (https://github.com/curso-r/treesnip/actions/runs/1049626089/workflow). I noticed that there's a call of remotes::install_deps() in a stage earlier than the Install dev lightgbm step. Since {lightgbm} is a dependency of {treesnip}, that step is going to install {lightgbm} from CRAN.

Seen in the logs for that step: https://github.com/curso-r/treesnip/runs/3116229035?check_suite_focus=true.

trying URL 'https://cloud.r-project.org/bin/windows/contrib/4.1/lightgbm_3.2.1.zip'
Content type 'application/zip' length 3335183 bytes (3.2 MB)
==================================================
downloaded 3.2 MB

If possible, could you try adding remove.packages("lightgbm") to the beginning of the install dev lightgbm step in that GHA job, and let me know if the issue still persists? I'm wondering if there's something being left behind from the CRAN install that is conflicting with the build from source.


@dfalbel and @dfsnow if you have time, could you also confirm whether you are using RStudio, and if so, whether your examples also produce this issue when that code is stored in a script and run with Rscript --vanilla test-code.R?

I suspect that both of you are using RStudio but @shiyu1994 did not in #4464 (comment), so I'd like to see if that is relevant.

@dfalbel
Copy link
Author

dfalbel commented Jul 25, 2021

Hi @jameslamb, thanks for looking at this!

I have added the remove.packages("lightgbm") call and the error still persists: https://github.com/curso-r/treesnip/runs/3154997458?check_suite_focus=true#step:11:50
I think install.packages ultimately always removes the existing package folder before installing the package again.

For the second question, I can confirm that error happens on both RStudio and on a vanilla R session:

$ Rscript --vanilla R/test.R
Loading required package: R6
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0          parsnip_0.1.6

loaded via a namespace (and not attached):
 [1] lattice_0.20-44   tidyr_1.1.3       fansi_0.5.0       utf8_1.2.1
 [5] crayon_1.4.1      dplyr_1.0.7       grid_4.1.0        jsonlite_1.7.2
 [9] lifecycle_1.0.0   magrittr_2.0.1    pillar_1.6.1      rlang_0.4.11
[13] data.table_1.14.0 Matrix_1.3-3      vctrs_0.3.8       generics_0.1.0
[17] ellipsis_0.3.2    tools_4.1.0       glue_1.4.2        purrr_0.3.4
[21] compiler_4.1.0    pkgconfig_2.0.3   tidyselect_1.1.1  tibble_3.1.2
Segmentation fault 

@jameslamb
Copy link
Collaborator

jameslamb commented Jul 25, 2021

I was able to reproduce this issue today using the latest master of LightGBM.

environment info and install instructions (click me)

I installed {lightgbm} from source on Windows 10 like so:

git clone --recursive git@github.com:microsoft/LightGBM.git
cd LightGBM
Rscript --vanilla -e "remove.packages('lightgbm')"
Rscript --vanilla -e "install.packages(c('R6', 'data.table', 'jsonlite'), repos = 'https://cran.r-project.org')"
Rscript --vanilla -e "install.packages(c('fansi'), repos = 'https://cran.r-project.org')"
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

I'm using Rtools40 downloaded on May 9, 2020, so not the newest one. As far as I can tell from https://cran.r-project.org/bin/windows/Rtools/history.html, it isn't possible to access previous versions of Rtools.

Output of sessionInfo().

Rscript -e "library(fansi); library(lightgbm); sessionInfo()"
Loading required package: R6
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lightgbm_3.2.1.99 R6_2.5.0          fansi_0.5.0

loaded via a namespace (and not attached):
[1] compiler_4.1.0    Matrix_1.3-3      grid_4.1.0        data.table_1.14.0
[5] jsonlite_1.7.2    lattice_0.20-44

Thanks to the helpful contributions of @dfalbel and @dfsnow so far, I was able to reduce this to an even smaller reproducible example, cutting out lgb.cv().

Running the script below with Rscript --vanilla test.R produces a segfault when dtrain$construct() is called.

library(fansi)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
dtrain$construct()

Next, I'll add a ton of logging to dataset construction to try to narrow down the issue further. I also plan to inspect fansi::.onLoad() and fansi::.onAttach(). Updates to follow!

@jameslamb
Copy link
Collaborator

Alright I added a bunch of log statements and I think I've narrowed down the place where this segfault is being thrown.

I'm able to reproduce the issue on master using this further-simplified example that uses a standard R matrix instead of loading the agaricus dataset.

library(fansi)
library(lightgbm)

dtrain <- lgb.Dataset(
    data = matrix(rnorm(1000), nrow = 100)
    , label = rnorm(100)
)
dtrain$construct()

The segfault is being thrown from calls to Network::num_machines() in the Dataset loader. Right here:

if (Network::num_machines() > 1) {

I pushed a branch with all the extra logging and with those calls skipped. https://github.com/jameslamb/LightGBM/tree/misc/investigating-dataset-segfault. On that branch, the reproducible example runs successfully and does not produce a segfault.

Next, I'm going to try to figure out how the behavior of this code is changed by loading {fansi} and by the order of package loading.

@jameslamb jameslamb mentioned this issue Jul 28, 2021
21 tasks
@jameslamb
Copy link
Collaborator

I'm convinced that the root of the problem is related to the way that R loads DLLs, and that @dfsnow is right that {lightgbm} and {fansi} are in conflict with each other somehow.

If {dplyr} is loaded before {lightgbm} but then the fansi DLL is unloaded before loading {lightgbm}, the reproducible example does not produce a segfault, and Dataset construction succeeds.

library(dplyr)
dyn.unload(file.path(.libPaths()[1], "fansi", "libs", "x64", "fansi.dll"))
library(lightgbm)
dtrain <- lgb.Dataset(
    data = matrix(rnorm(1000), nrow = 100)
    , label = rnorm(100)
)
dtrain$construct()

If {fansi}'s DLL is unloaded after loading {lightgbm}, that script produces a segfault at dtrain$construct().

This finding plus the finding from #4464 (comment) that commenting out Network::num_machines() causes Dataset construction to succeed has led me to this working theory:

Something in {fansi}'s DLL conflicts with lightgbm.dll or IPHLPAPI.DLL or WS2_32.dll (two libraries linked in with {lightgbm} to support distributed training).

I'm going to investigate this more closely with dumpbin and listdlls to see if I can identify the conflicts. I'm also going to try changing some details of {fansi} based on the advice in "Writing R Extensions", especially https://cran.r-project.org/doc/manuals/R-exts.html#Controlling-visibility.

Updates to follow!

@jameslamb
Copy link
Collaborator

Just to rule out another possibility like "loading any other package with compiled code before {lightgbm} is problematic"...I tried loading some other packages with compiled code before {lightgbm} and trying to construct a Dataset. These did not produce a segfault or any other issues.

I attempted {data.table} and {RPostgreSQL}, and checked that those packages' DLLs were loaded by running getLoadedDLLs().

@jameslamb jameslamb self-assigned this Jul 29, 2021
@jameslamb jameslamb changed the title [R-package] R package crashes on windows when loaded together with parsnip [R-package] R package crashes on windows when loaded together with {fansi} or anything that depends on it Jul 30, 2021
@StrikerRUS StrikerRUS reopened this Jul 30, 2021
@jameslamb
Copy link
Collaborator

@dfalbel @dfsnow thanks very much for your patience. I think I found the problem and have a fix up. Could you please try installing from my branch and let me know if it seems to resolve the issue?

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

@dfsnow
Copy link

dfsnow commented Aug 1, 2021

@jameslamb Your branch works for me. Fixes both {parsnip} and {fansi} with the following test script:

library(fansi)
library(lightgbm)
data(agaricus.train, package='lightgbm')
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
model <- lgb.cv(
  params = list(
    objective = "regression", 
    metric = "l2"
  ) , 
  data = dtrain
)

Also works with renv. Thanks for the quick turnaround!

@dfalbel
Copy link
Author

dfalbel commented Aug 1, 2021

Hey @jameslamb ! Thanks very much for the investigation and fix. I can confirm that this works great!

@vidarsumo
Copy link

@dfalbel @dfsnow thanks very much for your patience. I think I found the problem and have a fix up. Could you please try installing from my branch and let me know if it seems to resolve the issue?

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh
R CMD INSTALL lightgbm_3.2.1.99.tar.gz

When I run this

git clone --recursive https://github.com/microsoft/LightGBM.git --branch fix/network-setup
cd LightGBM
sh build-cran-package.sh

I get file not found

Removing files not needed for CRAN
Removing unknown pragmas in headers
File not found - *.h
File not found - *.h.bak

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 19, 2021

Some versions of the unix tools for Windows might have slightly different behavior. Can you try commenting out the uses of find in build-cran-package.sh?

@vidarsumo
Copy link

I commented this out (if I understood you correctly)
find . -name '*.h.bak' -o -name '*.hpp.bak' -o -name '*.cpp.bak' -exec rm {} \;

Then I ran sh build-cran-pacakge.sh and got

Removing files not needed for CRAN
Removing unknown pragmas in headers
File not found - *.h
Changing lib_lightgbm to lightgbm
Cleaning sed backup files
* checking for file 'lightgbm_r/DESCRIPTION' ... OK
* preparing 'lightgbm':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* looking to see if a 'data/datalist' file should be added
* building 'lightgbm_3.2.1.99.tar.gz'
Warning: file 'lightgbm/cleanup' did not have execute permissions: corrected
Warning: file 'lightgbm/configure' did not have execute permissions: corrected

I tried to run this R CMD INSTALL lightgbm_3.2.1.99.tar.gz but got:

* installing to library 'C:/Users/vidar/Documents/R/win-library/4.0'
* installing *source* package 'lightgbm' ...
** using staged installation
checking whether MM_PREFETCH works...no
checking whether MM_MALLOC works...no
** libs

*** arch - i386
C:/rtools40/usr/mingw_32/bin/g++  -std=gnu++11 -I"C:/PROGRA~1/R/R-4.0.1/include" -DNDEBUG -I./include -DEIGEN_MPL2_ONLY -DUSE_SOCKET -DLGB_R_BUILD      -fopenmp -pthread   -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign -c boosting/boosting.cpp -o boosting/boosting.o
sh: C:/rtools40/usr/mingw_32/bin/g++: No such file or directory
make: *** [C:/PROGRA~1/R/R-4.0.1/etc/i386/Makeconf:229: boosting/boosting.o] Error 127
ERROR: compilation failed for package 'lightgbm'
* removing 'C:/Users/vidar/Documents/R/win-library/4.0/lightgbm'
* restoring previous 'C:/Users/vidar/Documents/R/win-library/4.0/lightgbm'

@jameslamb
Copy link
Collaborator

C:/rtools40/usr/mingw_32/bin/g++: No such file or directory

That doesn't look specific to {lightgbm}. I expect that if you run install.packages("xgboost", type = "source", repos = "https://cran.r-project.org") (for example), you will hit a similar error.

When using R 4.x on Windows, if you plan to install packages from source it's expected that you have installed Rtools (click here to get it) at C:/rtools40. You may not have encountered Rtools before if you are not a package developer and have only installed packages from CRAN, since CRAN publishes precompiled packages for Windows.

@jameslamb
Copy link
Collaborator

Also, since I just noticed that error message is about building the 32-bit version of the library (arch - i386).

If you DO have Rtools installed but during the installation you chose to only install the 64-bit components, then use R CMD INSTALL --no-multiarch to skip building the 32-bit version of {lightgbm}.

@vidarsumo
Copy link

I do have Rtools 4.0 installed but there is no mingw_64 folder under C:/rtools40/usr/
I tried R CMD INSTALL lightgbm_3.2.1.99.tar.gz --no-multiarch and got this error: C:/rtools40/usr/mingw_64/bin/g++: No such file or directory

/mingw_64/bin/g++ does exist but not under /usr/. It's located in the root /rtools40/

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 2, 2021

oh! I see now. I think you might have downloaded RTools35 and installed it in directory C:/rtools40, and that might be failing because you're mixing R 4.x and Rtools35.

Rtools35 has folders (from the root of Rtools) named mingw_32/ and mingw_64, while Rtools40 has MinGW stuff in /usr/mingw32 and /usr/mingw64.

I know this for sure because we use those paths in this project's CI

if ($env:R_MAJOR_VERSION -eq "3") {
# Rtools 3.x has to be installed at C:\Rtools\
# * https://stackoverflow.com/a/46619260/3986677
$RTOOLS_INSTALL_PATH = "C:\Rtools"
$env:RTOOLS_BIN = "$RTOOLS_INSTALL_PATH\bin"
$env:RTOOLS_MINGW_BIN = "$RTOOLS_INSTALL_PATH\mingw_64\bin"

} elseif ($env:R_MAJOR_VERSION -eq "4") {
$RTOOLS_INSTALL_PATH = "C:\rtools40"
$env:RTOOLS_BIN = "$RTOOLS_INSTALL_PATH\usr\bin"
$env:RTOOLS_MINGW_BIN = "$RTOOLS_INSTALL_PATH\mingw64\bin"

You might need to visit https://cran.r-project.org/bin/windows/Rtools/ and get the newest version of Rtools.

And you might find some of the discussion about similar issues (a path for Rtools being assumed and hard-coded into some versions of R) at https://stackoverflow.com/questions/39090983/rcpp-rtools-installed-but-error-message-g-not-found.

Can you also please try installing another package requiring compilation from source?

Rscript -e "install.packages('data.table', type = 'source', repos = 'https://cran.r-project.org')"

I expect you'll experience this same problem doing that, and if you do then I think that would confirm that this isn't an issue with {lightgbm} specifically but with your local setup generally.

@vidarsumo
Copy link

After solving a problem related to rtools everything works now :)

This was installed for R-4.0.x even though I have R-4.1.x installed. Is this not supported for R 4.1.x?

@jameslamb
Copy link
Collaborator

This was installed for R-4.0.x even though I have R-4.1.x installed. Is this not supported for R 4.1.x?

I'm not sure what you mean by this statement, sorry. If you have multiple versions of R on your system, please examine the PATH environment variable to see which version(s) are on PATH and in which order.

You might also try the following from a command prompt to inspect which version of R is first on your PATH.

# version of R
Rscript --version

# where the R executables are
Rscript -e "print(R.home())"

# where packages will be installed to / loaded from
Rscript -e "print(.libPaths())"

@vidarsumo
Copy link

I have multiple versions. 4.0.1 was first on PATH.
And running Rscript --version gave this:

Rscript --version
R scripting front-end version 4.0.1 (2020-06-06)

Didn't know about it. Thanks for the help.

@jameslamb
Copy link
Collaborator

Now that #4496 has been merged, I believe this issue has been resolved.

Thanks so much to everyone involved here for your help with reproducible examples and debugging ideas!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.