Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] prevent symbol lookup conflicts (fixes #4045) #4155

Merged
merged 13 commits into from
Apr 30, 2021

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Apr 2, 2021

This PR fixes a critical issue affecting Windows users of the R package. In #4045, several users reported a problem like "if I load certain R libraries before {lightgbm}, {lightgbm} can cause the R session to crash when constructing a dataset".

After some investigation (detailed in #4045 (comment)), I believe I found the root cause for this. The routine R_registerRoutines() at

R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
isn't actually being called by R!

The key piece of documentation is at https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Converting-a-package-to-use-registration.

Construct the registration table ... if included in a C++ file the ‘R_init’ function would need to be declared extern "C"):

This PR adds that declaration to lightgbm_R.cpp and adds a unit test to prevent this type of mistake from being re-introduced in the future.

Fixes #4045.
Addresses the Windows half of #4034.

I need to test if this would fix #4007. Will update this description once I try those.


Background

R provides an interface for "native routine registration", a mechanism to tell R exactly which symbols it should expect to find in a shared library (.dll / .so). This makes calls to compiled code with .Call() safer because R looks in a specific place.

If that type of registration isn't done, then it can create the possibility for conflicts with other loaded DLLs. From https://cran.r-project.org/doc/manuals/r-release/R-exts.html#dyn_002eload-and-dyn_002eunload:

If a shared object/DLL is loaded more than once the most recent version is used. More generally, if the same symbol name appears in several shared objects, the most recently loaded occurrence is used. The PACKAGE argument and registration (see the next section) provide good ways to avoid any ambiguity in which occurrence is meant.

This note from https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Registering-native-routines is relevant as well.

In calls to .C, .Call, .Fortran and .External, R must locate the specified native routine by looking in the appropriate shared object/DLL. By default, R uses the operating-system-specific dynamic loader to lookup the symbol in all132 loaded DLLs and the R executable or libraries it is linked to. Alternatively, the author of the DLL can explicitly register routines with R and use a single, platform-independent mechanism for finding the routines in the DLL. One can use this registration mechanism to provide additional information about a routine, including the number and type of the arguments, and also make it available to R programmers under a different name.


How to test this

You can run the following code to check if registration was successful.

library(lightgbm)
dll_info <- getLoadedDLLs()[["lightgbm"]]

# check that dynamic lookup has been disabled
!isTRUE(dll_info[["dynamicLookup"]])

# check that all the expected symbols show up
getDLLRegisteredRoutines(dll_info[["path"]])
                               .Call .Call.numParameters
1                LGBM_GetLastError_R                   3
2       LGBM_DatasetCreateFromFile_R                   5
3        LGBM_DatasetCreateFromCSC_R                  10
4        LGBM_DatasetCreateFromMat_R                   7
5            LGBM_DatasetGetSubset_R                   6
6      LGBM_DatasetSetFeatureNames_R                   3
7      LGBM_DatasetGetFeatureNames_R                   5
8           LGBM_DatasetSaveBinary_R                   3
9                 LGBM_DatasetFree_R                   2
10            LGBM_DatasetSetField_R                   5
11        LGBM_DatasetGetFieldSize_R                   4
12            LGBM_DatasetGetField_R                   4
13 LGBM_DatasetUpdateParamChecking_R                   3
14          LGBM_DatasetGetNumData_R                   3
15       LGBM_DatasetGetNumFeature_R                   3
16              LGBM_BoosterCreate_R                   4
17                LGBM_BoosterFree_R                   2
18 LGBM_BoosterCreateFromModelfile_R                   3
19 LGBM_BoosterLoadModelFromString_R                   3
20               LGBM_BoosterMerge_R                   3
21        LGBM_BoosterAddValidData_R                   3
22   LGBM_BoosterResetTrainingData_R                   3
23      LGBM_BoosterResetParameter_R                   3
24       LGBM_BoosterGetNumClasses_R                   3
25       LGBM_BoosterUpdateOneIter_R                   2
26 LGBM_BoosterUpdateOneIterCustom_R                   5
27     LGBM_BoosterRollbackOneIter_R                   2
28 LGBM_BoosterGetCurrentIteration_R                   3
29  LGBM_BoosterGetUpperBoundValue_R                   3
30  LGBM_BoosterGetLowerBoundValue_R                   3
31        LGBM_BoosterGetEvalNames_R                   5
32             LGBM_BoosterGetEval_R                   4
33       LGBM_BoosterGetNumPredict_R                   4
34          LGBM_BoosterGetPredict_R                   4
35      LGBM_BoosterPredictForFile_R                  11
36      LGBM_BoosterCalcNumPredict_R                   9
37       LGBM_BoosterPredictForCSC_R                  15
38       LGBM_BoosterPredictForMat_R                  12
39           LGBM_BoosterSaveModel_R                   5
40   LGBM_BoosterSaveModelToString_R                   7
41           LGBM_BoosterDumpModel_R                   

# like the one documented in
# https://github.com/microsoft/LightGBM/issues/4045#issuecomment-812289182
test_that("lightgbm routine registration worked", {
testthat::skip_if_not(ON_WINDOWS)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this skip_if_not() is to work around this error I saw on Linux jobs in CI

* checking tests ...
  Running ‘testthat.R’
 � � ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  ══ Failed tests ════════════════════════════════════════════════════════════════
  ── Failure (test_registration.R:8:5): lightgbm routine registration worked ─────
  dll_info[["dynamicLookup"]] is not FALSE
  
  `actual` is NULL
  `expected` is a logical vector (FALSE)
  ── Error (test_registration.R:11:5): lightgbm routine registration worked ──────
  Error: no applicable method for 'getDLLRegisteredRoutines' applied to an object of class "NULL"
  Backtrace:
      █
   1. └─base::getDLLRegisteredRoutines(dll_info[["path"]]) test_registration.R:11:4
  
  [ FAIL 2 | WARN 0 | SKIP 3 | PASS 634 ]
  Error: Test failures
  Execution halted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! The problem with this test isn't actually about operating system. This test will fail with an error like the one above on LightGBM's CMake-based builds and pass on CRAN builds. That's because, by using install.libs.R to build lib_lightgbm instead of R's built-in machinery, the routine registration stuff doesn't work.

Given that as of this PR the package will now use literal R symbols (e.g. .Call(LGBM_BoosterGetCurrentIteration_R,...), I don't think this test is necessary. Any failures in the registration will result in "object not found errors".

I'm going to remove this test, and will provide more explanation in a separate comment for how to make this work with CMake-based builds.

@jameslamb
Copy link
Collaborator Author

hmmm, seeing a few things in CI that I didn't see locally.

* checking foreign function calls ... NOTE

Registration problems:

  symbol 'fun_name' not in namespace:

   .Call(fun_name, ..., ret, call_state, PACKAGE = "lightgbm")

  symbol 'fun_name' not in namespace:

   .Call(fun_name, ..., call_state, PACKAGE = "lightgbm")

See chapter 'System and foreign language interfaces' in the 'Writing R

Extensions' manual.

I'm going to remove reviewers and move this back to draft for now

@jameslamb jameslamb marked this pull request as draft April 2, 2021 04:49
@StrikerRUS

This comment has been minimized.

@jameslamb jameslamb changed the title [R-package] prevent symbol lookup conflicts (fixes #4045) WIP: [R-package] prevent symbol lookup conflicts (fixes #4045) Apr 2, 2021
@jameslamb

This comment has been minimized.

@StrikerRUS

This comment has been minimized.

@StrikerRUS
Copy link
Collaborator

Can resolving #3016 help with this issue as well?

@jameslamb
Copy link
Collaborator Author

Can resolving #3016 help with this issue as well?

Right now, I don't think so.

I think the problem is that {lightgbm} wraps .Call() calls in a function to handle errors called lgb.call() (

lgb.call <- function(fun_name, ret, ...) {
), but if you are using routine registration then R CMD check --as-cran expects that the first argument to .Call() is one of the symbols you've registered. Like this: https://github.com/Rdatatable/data.table/blob/ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4/R/bmerge.R#L181.

I'm experimenting with this today. I think error handling can be moved into the C++ side and then lgb.call() can be factored out. It's an internal function so there wouldn't be any user-facing impact.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Apr 26, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/783943451

Status: success ✔️.

@jameslamb jameslamb marked this pull request as ready for review April 26, 2021 03:44
@jameslamb jameslamb requested a review from StrikerRUS as a code owner April 26, 2021 03:44
@jameslamb
Copy link
Collaborator Author

Ok, I think that this PR is ready for review! I know the description has a lot of information and this requires some deep knowledge of how R's build system works, so I'll summarize here:

  1. Today, {lightgbm} code that calls C++ functions with .Call() is using an unsafe strategy to search for symbols, and that can lead to serious conflicts with other loaded packages (see [R-package] FATAL Error when run train() #4045)
  2. The preferred, and safer, way to do this is to use "routine registration" (documented in the PR description)
  3. When you use registration, R will create R objects in the package's namespace and expect you to use them. So, for example, for the C++ function LGBM_DatasetCreateFromCSC_R, you have to use .Call(LGBM_DatasetCreateFromCSC_R, ...), not .Call("LGBM_DatasetCreateFromCSC_R", ..., PACKAGE = "lightgbm").
  4. When you use install.libs.R to build a custom .so/.dll (as {lightgbm} does for CMake-based builds), this automatic creation of R objects will not work. To force that to be done, you can use code like useDynLib(lib_lightgbm, LGBM_DatasetCreateFromCSC_R, ...)

As a result of the changes in this PR:

  1. {lightgbm} should be safer to use with other packages, and the critical issue [R-package] FATAL Error when run train() #4045 should be fixed
  2. {lightgbm}'s CRAN package will be structured in a way that is closer to the recommended practices documented in https://cran.r-project.org/doc/manuals/r-release/R-exts.html

@jameslamb jameslamb changed the title WIP: [R-package] prevent symbol lookup conflicts (fixes #4045) [R-package] prevent symbol lookup conflicts (fixes #4045) Apr 26, 2021
@jameslamb jameslamb requested review from Laurae2 and shiyu1994 April 26, 2021 03:45
@@ -59,53 +57,6 @@ lgb.last_error <- function() {

}

lgb.call <- function(fun_name, ret, ...) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lgb.call() functions can be removed because once routine registration is used, referencing routines using strings does not work: #4155 (comment).

@@ -13,7 +13,12 @@ Dataset <- R6::R6Class(
if (!lgb.is.null.handle(x = private$handle)) {

# Freeing up handle
lgb.call(fun_name = "LGBM_DatasetFree_R", ret = NULL, private$handle)
call_state <- 0L
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though error-handling has now been moved into the C++ side (#4163), as of this PR passing in call_state is still needed. The functions in https://github.com/microsoft/LightGBM/blob/b6c71e5e941b885b1a5169d460c430df9d392fa9/R-package/src/lightgbm_R.h has a return type LGBM_SE, which is the LightGBM equivalent of R's built-in SEXP type.

If you change those routines to all be void (since they modify their inputs by reference and don't technically have to refer to anything), errors are thrown about not being able to convert a null to SEXP.

So the call_state thing is a simple way to allocate an LGBM_SE (an R integer, in this case) that can then be returned.

As part of #3016 we'll be able to remove these unnecessary call_state things, but I believe they're harmless for now.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent improvement for the package! I'm approving based on the detailed PR description, linked paragraph of article in comments and green CI checks.
Just two minor comments below.

# CMake-based builds can't currently use R's builtin routine registration,
# so have to update NAMESPACE manually, with a statement like this:
#
# useDynLib(lib_lightgbm, LGBM_GetLastError_R, LGBM_DatasetCreateFromFile_R, ...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be easier to have this content by default in the file, and remove it for CRAN build?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I'm not sure. This looks like more code than having this in the file by default and removing it for CRAN builds, but I thought this R code would be easier to write than the equivalent code with sed in build-cran-package.sh. Just because it involves replacing / writing literal . and literal ) / ( and I was intimidated by figuring out how to do the escaping correctly in sed haha.

Do you have a strong preference for this change? Otherwise, I'd prefer to leave this as-is for now and maybe make that change as some maintenance work in the future. Just because I'd like to keep pushing forward on #3016 and it doesn't seem like there is anything wrong about the current approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this was just a proposal to simplify the code. I'm totally fine with the working current approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great, thank you very much!

return(
lgb.call.return.str(
fun_name = "LGBM_BoosterSaveModelToString_R"
# Create buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write a new helper function for calls with buffer allocation? I believe it will improve the code readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was exactly what lgb.call.return.str() was. But since you cannot pass a string with a function name to .Call() once registration is used (#4155 (comment)), I couldn't see a good way to keep a shared helper function.

I also think we'll be able to factor this code away completely soon, as part of #3016, by just creating the R character vector on the C++ side and returning it. See #4242 for a first example of that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I see!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
2 participants