-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] move creation of character vectors in some methods to C++ side #4256
Conversation
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-649945bfb882455db3350e91e7437cba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with removed reallocating buffer code...
R_API_BEGIN(); | ||
int64_t out_len = 0; | ||
int64_t buf_len = static_cast<int64_t>(Rf_asInteger(buffer_len)); | ||
int64_t buf_len = 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if 1024 * 1024
is not enough to save some big model? With removed "try with default len, repeat with actual if not enough" (if (act_len > buf_len)
) this now looks like a regression compared to the current fully correct implementation.
C API docs says:
buffer_len – String buffer length, if
buffer_len < out_len
, you should re-allocate buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OHHHH I see now, thank you for that explanation. I misunderstood the purpose of the code on the R side that was calling this function twice.
Ok yes you're right, that work needs to be done here. Will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I've made these changes in recent commits.
I'm really glad you pointed this out, because it also made me realize an opportunity to allow larger feature names! Right now, the code in LGBM_DatasetGetFeatureNames_R
on master
will not allow any feature names longer than 256 characters.
LightGBM/R-package/src/lightgbm_R.cpp
Line 163 in f831808
const size_t reserved_string_size = 256; |
LightGBM/R-package/src/lightgbm_R.cpp
Line 179 in f831808
CHECK_GE(reserved_string_size, required_string_size); |
The code below throws an error on {lightgbm}
3.2.1, but works as of this branch.
library(lightgbm)
feature_names <- names(iris)
long_name <- paste0(rep("a", 1000L), collapse = "")
feature_names[1L] <- long_name
names(iris) <- feature_names
# check that feature name survived the trip from R to C++ and back
dtrain <- lgb.Dataset(
data = as.matrix(iris[, -5L])
, label = as.numeric(iris$Species) - 1L
)
dtrain$construct()
col_names <- dtrain$get_colnames()
# Error in lgb.call(fun_name = fun_name, ret = buf, ..., buf_len, act_len) :
# [LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177 .
But it should be possible to! Based on
LightGBM/include/LightGBM/c_api.h
Line 295 in f831808
* \param[out] out_buffer_len String sizes required to do the full string copies |
So I've updated the calls to LGBM_DatasetGetFeatureNames_R
and LGBM_BoosterGetEvalNames_R
to retry with a larger buffer on long names.
I've added tests to this PR to check that this is working as expected.
LGBM_DatasetGetFeatureNames_R()
:test_dataset.R
LGBM_BoosterGetEvalNames_R()
: I could not find a way to generate a large string value for this, but I might misunderstand howLGBM_BoosterGetEvalNames
works. Opened [docs] what should LGBM_BoosterGetEvalNames be used for? #4264 with a question.LGBM_BoosterSaveModelToString_R()
:test_lgb.Booster.R
LGBM_BoosterDumpModel_R()
:test_lgb.Booster.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've updated the calls to
LGBM_DatasetGetFeatureNames_R
andLGBM_BoosterGetEvalNames_R
to retry with a larger buffer on long names.
Great!
I think we can do the same for Python wrapper.
LightGBM/python-package/lightgbm/basic.py
Lines 1896 to 1900 in f831808
if reserved_string_buffer_size < required_string_buffer_size.value: | |
raise BufferError( | |
"Allocated feature name buffer size ({}) was inferior to the needed size ({})." | |
.format(reserved_string_buffer_size, required_string_buffer_size.value) | |
) |
LightGBM/python-package/lightgbm/basic.py
Lines 3273 to 3277 in f831808
if reserved_string_buffer_size < required_string_buffer_size.value: | |
raise BufferError( | |
"Allocated feature name buffer size ({}) was inferior to the needed size ({})." | |
.format(reserved_string_buffer_size, required_string_buffer_size.value) | |
) |
LightGBM/python-package/lightgbm/basic.py
Lines 3472 to 3476 in f831808
if reserved_string_buffer_size < required_string_buffer_size.value: | |
raise BufferError( | |
"Allocated eval name buffer size ({}) was inferior to the needed size ({})." | |
.format(reserved_string_buffer_size, required_string_buffer_size.value) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do the same for Python wrapper.
Sure! But I think it should be a separate PR. I'd really like to focus on finishing #3016 soon.
~ I'm still experimenting with this, so marked it WIP to block merging even though other checks passed.~ ok, this is ready for review |
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-a3cb7cb2e0a24887b228a5f4236e1139 |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! Looks great overall!
I left a few minor comments below.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
/gha run r-solaris Workflow Solaris CRAN check has been triggered! 🚀 solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-e88d6530114d47cba8c08d99946aa1c3 |
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing comments!
Just posted issue in Sphinx repository for failing |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Another step towards #3016.
Resolves #4155 (comment).
This PR affects the following functions that return an R character vector:
LGBM_DatasetGetFeatureNames_R()
LGBM_BoosterGetEvalNames_R()
LGBM_BoosterSaveModelToString_R()
LGBM_BoosterDumpModel_R()
Currently, a buffer and an R character vector are allocated on the R side, then the corresponding C++ functions are called to write model data to that buffer and, eventually, copy it into that character vector.
This PR proposes simplifying that interaction by just creating that character vector from the C++ side and returning it to R. This has the following benefits:
EncodeChar
,R_CHAR_PTR
), replacing some LightGBM-custom stuff with standard routines available from RReferences