Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] move creation of character vectors in some methods to C++ side #4256

Merged
merged 17 commits into from
May 9, 2021

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented May 5, 2021

Another step towards #3016.

Resolves #4155 (comment).

This PR affects the following functions that return an R character vector:

  • LGBM_DatasetGetFeatureNames_R()
  • LGBM_BoosterGetEvalNames_R()
  • LGBM_BoosterSaveModelToString_R()
  • LGBM_BoosterDumpModel_R()

Currently, a buffer and an R character vector are allocated on the R side, then the corresponding C++ functions are called to write model data to that buffer and, eventually, copy it into that character vector.

This PR proposes simplifying that interaction by just creating that character vector from the C++ side and returning it to R. This has the following benefits:

  • eliminates some unnecessary computation, like joining all eval names into one tab-separated string and then splitting it back apart on the R side
  • removes code from the R package that was involved in managing buffers
  • removes unnecessary code on the C++ side (EncodeChar, R_CHAR_PTR), replacing some LightGBM-custom stuff with standard routines available from R

References

@jameslamb jameslamb changed the title [R-package] move creation of character vectors in some methods to C++ side WIP: [R-package] move creation of character vectors in some methods to C++ side May 5, 2021
@jameslamb jameslamb marked this pull request as ready for review May 5, 2021 14:40
@jameslamb jameslamb requested a review from Laurae2 as a code owner May 5, 2021 14:40
@jameslamb jameslamb changed the title WIP: [R-package] move creation of character vectors in some methods to C++ side [R-package] move creation of character vectors in some methods to C++ side May 5, 2021
@jameslamb jameslamb requested review from StrikerRUS and shiyu1994 May 5, 2021 14:40
@jameslamb
Copy link
Collaborator Author

jameslamb commented May 5, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/813773634

Status: success ✔️.

@jameslamb
Copy link
Collaborator Author

jameslamb commented May 5, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/813773961

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-649945bfb882455db3350e91e7437cba
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-0f8bfa82997c4b5c8ef727c48a33a0ce
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with removed reallocating buffer code...

R-package/src/lightgbm_R.h Outdated Show resolved Hide resolved
R-package/src/lightgbm_R.h Outdated Show resolved Hide resolved
R_API_BEGIN();
int64_t out_len = 0;
int64_t buf_len = static_cast<int64_t>(Rf_asInteger(buffer_len));
int64_t buf_len = 1024 * 1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if 1024 * 1024 is not enough to save some big model? With removed "try with default len, repeat with actual if not enough" (if (act_len > buf_len)) this now looks like a regression compared to the current fully correct implementation.

C API docs says:

buffer_len – String buffer length, if buffer_len < out_len, you should re-allocate buffer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OHHHH I see now, thank you for that explanation. I misunderstood the purpose of the code on the R side that was calling this function twice.

Ok yes you're right, that work needs to be done here. Will update it.

Copy link
Collaborator Author

@jameslamb jameslamb May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've made these changes in recent commits.

I'm really glad you pointed this out, because it also made me realize an opportunity to allow larger feature names! Right now, the code in LGBM_DatasetGetFeatureNames_R on master will not allow any feature names longer than 256 characters.

const size_t reserved_string_size = 256;

CHECK_GE(reserved_string_size, required_string_size);

The code below throws an error on {lightgbm} 3.2.1, but works as of this branch.

library(lightgbm)

feature_names <- names(iris)
long_name <- paste0(rep("a", 1000L), collapse = "")
feature_names[1L] <- long_name
names(iris) <- feature_names
# check that feature name survived the trip from R to C++ and back
dtrain <- lgb.Dataset(
    data = as.matrix(iris[, -5L])
    , label = as.numeric(iris$Species) - 1L
)
dtrain$construct()
col_names <- dtrain$get_colnames()

# Error in lgb.call(fun_name = fun_name, ret = buf, ..., buf_len, act_len) : 
#  [LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177 .

But it should be possible to! Based on

* \param[out] out_buffer_len String sizes required to do the full string copies
.

So I've updated the calls to LGBM_DatasetGetFeatureNames_R and LGBM_BoosterGetEvalNames_R to retry with a larger buffer on long names.


I've added tests to this PR to check that this is working as expected.

  • LGBM_DatasetGetFeatureNames_R(): test_dataset.R
  • LGBM_BoosterGetEvalNames_R(): I could not find a way to generate a large string value for this, but I might misunderstand how LGBM_BoosterGetEvalNames works. Opened [docs] what should LGBM_BoosterGetEvalNames be used for? #4264 with a question.
  • LGBM_BoosterSaveModelToString_R(): test_lgb.Booster.R
  • LGBM_BoosterDumpModel_R(): test_lgb.Booster.R

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've updated the calls to LGBM_DatasetGetFeatureNames_R and LGBM_BoosterGetEvalNames_R to retry with a larger buffer on long names.

Great!
I think we can do the same for Python wrapper.

if reserved_string_buffer_size < required_string_buffer_size.value:
raise BufferError(
"Allocated feature name buffer size ({}) was inferior to the needed size ({})."
.format(reserved_string_buffer_size, required_string_buffer_size.value)
)

if reserved_string_buffer_size < required_string_buffer_size.value:
raise BufferError(
"Allocated feature name buffer size ({}) was inferior to the needed size ({})."
.format(reserved_string_buffer_size, required_string_buffer_size.value)
)

if reserved_string_buffer_size < required_string_buffer_size.value:
raise BufferError(
"Allocated eval name buffer size ({}) was inferior to the needed size ({})."
.format(reserved_string_buffer_size, required_string_buffer_size.value)
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do the same for Python wrapper.

Sure! But I think it should be a separate PR. I'd really like to focus on finishing #3016 soon.

@jameslamb jameslamb changed the title [R-package] move creation of character vectors in some methods to C++ side WIP: [R-package] move creation of character vectors in some methods to C++ side May 7, 2021
@jameslamb
Copy link
Collaborator Author

jameslamb commented May 7, 2021

~ I'm still experimenting with this, so marked it WIP to block merging even though other checks passed.~

ok, this is ready for review

@jameslamb jameslamb changed the title WIP: [R-package] move creation of character vectors in some methods to C++ side [R-package] move creation of character vectors in some methods to C++ side May 7, 2021
@jameslamb
Copy link
Collaborator Author

jameslamb commented May 7, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/821198971

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-a3cb7cb2e0a24887b228a5f4236e1139
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-0ef72af280ec41f490c3e6fcce6b16d2
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

@jameslamb
Copy link
Collaborator Author

jameslamb commented May 7, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/821198127

Status: success ✔️.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Looks great overall!
I left a few minor comments below.

R-package/src/lightgbm_R.h Outdated Show resolved Hide resolved
R-package/src/lightgbm_R.h Outdated Show resolved Hide resolved
R-package/src/lightgbm_R.h Outdated Show resolved Hide resolved
R-package/tests/testthat/test_lgb.Booster.R Show resolved Hide resolved
R-package/tests/testthat/test_lgb.Booster.R Outdated Show resolved Hide resolved
R-package/src/lightgbm_R.cpp Outdated Show resolved Hide resolved
R-package/src/lightgbm_R.cpp Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

jameslamb commented May 9, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/824536878

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-e88d6530114d47cba8c08d99946aa1c3
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-6c065f2869e04f8cadff9285970d28bc
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

@jameslamb
Copy link
Collaborator Author

jameslamb commented May 9, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/824537273

Status: success ✔️.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing comments!

@StrikerRUS
Copy link
Collaborator

Just posted issue in Sphinx repository for failing Static Analysis / check-docs (pull_request) job: sphinx-doc/sphinx#9188.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants