[R-package] move creation of character vectors in some methods to C++ side #4256

jameslamb · 2021-05-05T14:17:01Z

Another step towards #3016.

This PR affects the following functions that return an R character vector:

LGBM_DatasetGetFeatureNames_R()
LGBM_BoosterGetEvalNames_R()
LGBM_BoosterSaveModelToString_R()
LGBM_BoosterDumpModel_R()

Currently, a buffer and an R character vector are allocated on the R side, then the corresponding C++ functions are called to write model data to that buffer and, eventually, copy it into that character vector.

This PR proposes simplifying that interaction by just creating that character vector from the C++ side and returning it to R. This has the following benefits:

eliminates some unnecessary computation, like joining all eval names into one tab-separated string and then splitting it back apart on the R side
removes code from the R package that was involved in managing buffers
removes unnecessary code on the C++ side (EncodeChar, R_CHAR_PTR), replacing some LightGBM-custom stuff with standard routines available from R

References

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Handling-character-data

… side

jameslamb · 2021-05-05T14:41:01Z

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/813773634

Status: success ✔️.

jameslamb · 2021-05-05T14:41:09Z

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/813773961

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-649945bfb882455db3350e91e7437cba
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-0f8bfa82997c4b5c8ef727c48a33a0ce
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

StrikerRUS

I'm confused with removed reallocating buffer code...

R-package/src/lightgbm_R.h

StrikerRUS · 2021-05-06T18:06:46Z

R-package/src/lightgbm_R.cpp

  R_API_BEGIN();
  int64_t out_len = 0;
-  int64_t buf_len = static_cast<int64_t>(Rf_asInteger(buffer_len));
+  int64_t buf_len = 1024 * 1024;


What if 1024 * 1024 is not enough to save some big model? With removed "try with default len, repeat with actual if not enough" (if (act_len > buf_len)) this now looks like a regression compared to the current fully correct implementation.

C API docs says:

buffer_len – String buffer length, if buffer_len < out_len, you should re-allocate buffer

OHHHH I see now, thank you for that explanation. I misunderstood the purpose of the code on the R side that was calling this function twice.

Ok yes you're right, that work needs to be done here. Will update it.

Alright, I've made these changes in recent commits.

I'm really glad you pointed this out, because it also made me realize an opportunity to allow larger feature names! Right now, the code in LGBM_DatasetGetFeatureNames_R on master will not allow any feature names longer than 256 characters.

LightGBM/R-package/src/lightgbm_R.cpp

Line 163 in f831808

const size_t reserved_string_size = 256;

LightGBM/R-package/src/lightgbm_R.cpp

Line 179 in f831808

CHECK_GE(reserved_string_size, required_string_size);

The code below throws an error on {lightgbm} 3.2.1, but works as of this branch.

library(lightgbm) feature_names <- names(iris) long_name <- paste0(rep("a", 1000L), collapse = "") feature_names[1L] <- long_name names(iris) <- feature_names # check that feature name survived the trip from R to C++ and back dtrain <- lgb.Dataset( data = as.matrix(iris[, -5L]) , label = as.numeric(iris$Species) - 1L ) dtrain$construct() col_names <- dtrain$get_colnames() # Error in lgb.call(fun_name = fun_name, ret = buf, ..., buf_len, act_len) : # [LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177 .

But it should be possible to! Based on

LightGBM/include/LightGBM/c_api.h

Line 295 in f831808

* \param[out] out_buffer_len String sizes required to do the full string copies

.

So I've updated the calls to LGBM_DatasetGetFeatureNames_R and LGBM_BoosterGetEvalNames_R to retry with a larger buffer on long names.

I've added tests to this PR to check that this is working as expected.

LGBM_DatasetGetFeatureNames_R(): test_dataset.R

LGBM_BoosterGetEvalNames_R(): I could not find a way to generate a large string value for this, but I might misunderstand how LGBM_BoosterGetEvalNames works. Opened [docs] what should LGBM_BoosterGetEvalNames be used for? #4264 with a question.

LGBM_BoosterSaveModelToString_R(): test_lgb.Booster.R

LGBM_BoosterDumpModel_R(): test_lgb.Booster.R

So I've updated the calls to LGBM_DatasetGetFeatureNames_R and LGBM_BoosterGetEvalNames_R to retry with a larger buffer on long names.

Great!
I think we can do the same for Python wrapper.

LightGBM/python-package/lightgbm/basic.py

Lines 1896 to 1900 in f831808

if reserved_string_buffer_size < required_string_buffer_size.value:

raise BufferError(

"Allocated feature name buffer size ({}) was inferior to the needed size ({})."

.format(reserved_string_buffer_size, required_string_buffer_size.value)

)

LightGBM/python-package/lightgbm/basic.py

Lines 3273 to 3277 in f831808

if reserved_string_buffer_size < required_string_buffer_size.value:

raise BufferError(

"Allocated feature name buffer size ({}) was inferior to the needed size ({})."

.format(reserved_string_buffer_size, required_string_buffer_size.value)

)

LightGBM/python-package/lightgbm/basic.py

Lines 3472 to 3476 in f831808

if reserved_string_buffer_size < required_string_buffer_size.value:

raise BufferError(

"Allocated eval name buffer size ({}) was inferior to the needed size ({})."

.format(reserved_string_buffer_size, required_string_buffer_size.value)

)

I think we can do the same for Python wrapper.

Sure! But I think it should be a separate PR. I'd really like to focus on finishing #3016 soon.

jameslamb · 2021-05-07T03:27:41Z

~ I'm still experimenting with this, so marked it WIP to block merging even though other checks passed.~

ok, this is ready for review

jameslamb · 2021-05-07T18:20:37Z

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/821198971

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-a3cb7cb2e0a24887b228a5f4236e1139
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-0ef72af280ec41f490c3e6fcce6b16d2
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

jameslamb · 2021-05-07T18:20:51Z

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/821198127

Status: success ✔️.

StrikerRUS

Thank you very much! Looks great overall!
I left a few minor comments below.

R-package/src/lightgbm_R.h

R-package/tests/testthat/test_lgb.Booster.R

R-package/src/lightgbm_R.cpp

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

jameslamb · 2021-05-09T04:43:34Z

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/824536878

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-e88d6530114d47cba8c08d99946aa1c3
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-6c065f2869e04f8cadff9285970d28bc
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

jameslamb · 2021-05-09T04:43:42Z

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/824537273

Status: success ✔️.

StrikerRUS

Thanks for addressing comments!

StrikerRUS · 2021-05-09T11:20:12Z

Just posted issue in Sphinx repository for failing Static Analysis / check-docs (pull_request) job: sphinx-doc/sphinx#9188.

github-actions · 2023-08-23T22:38:51Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added 6 commits May 4, 2021 23:09

[R-package] move creation of character vectors in some methods to C++…

2aab103

… side

convert LGBM_BoosterGetEvalNames_R

3d6e989

convert LGBM_BoosterDumpModel_R and LGBM_BoosterSaveModelToString_R

63856c2

remove debugging code

826ae22

update docs

1f07b05

remove comment

f84b84d

jameslamb changed the title ~~[R-package] move creation of character vectors in some methods to C++ side~~ WIP: [R-package] move creation of character vectors in some methods to C++ side May 5, 2021

jameslamb added the maintenance label May 5, 2021

jameslamb marked this pull request as ready for review May 5, 2021 14:40

jameslamb requested a review from Laurae2 as a code owner May 5, 2021 14:40

jameslamb changed the title ~~WIP: [R-package] move creation of character vectors in some methods to C++ side~~ [R-package] move creation of character vectors in some methods to C++ side May 5, 2021

jameslamb requested review from StrikerRUS and shiyu1994 May 5, 2021 14:40

jameslamb mentioned this pull request May 5, 2021

[docs][R-package] update docs on C++ interface #4257

Merged

StrikerRUS reviewed May 6, 2021

View reviewed changes

jameslamb added 3 commits May 6, 2021 20:18

Merge branch 'master' into r/encode-char

6e80210

add handling for larger model strings

147e1c2

handle large strings in feature and eval names

c272765

jameslamb changed the title ~~[R-package] move creation of character vectors in some methods to C++ side~~ WIP: [R-package] move creation of character vectors in some methods to C++ side May 7, 2021

got long feature names working

31ce4bd

jameslamb added the in progress label May 7, 2021

more fixes

1ca0f5e

jameslamb mentioned this pull request May 7, 2021

[docs] what should LGBM_BoosterGetEvalNames be used for? #4264

Closed

jameslamb added 2 commits May 7, 2021 12:15

linting

d5a47aa

resize

179bcec

jameslamb changed the title ~~WIP: [R-package] move creation of character vectors in some methods to C++ side~~ [R-package] move creation of character vectors in some methods to C++ side May 7, 2021

jameslamb removed the in progress label May 7, 2021

jameslamb added the awaiting review label May 7, 2021

jameslamb mentioned this pull request May 8, 2021

[R-package] manage Dataset and Booster handles as R external pointers (fixes #3016) #4265

Merged

StrikerRUS removed the awaiting review label May 8, 2021

StrikerRUS requested changes May 8, 2021

View reviewed changes

jameslamb and others added 3 commits May 8, 2021 23:28

Apply suggestions from code review

d5f5178

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'master' into r/encode-char

26599ea

stricter test

6430f62

StrikerRUS approved these changes May 9, 2021

View reviewed changes

StrikerRUS mentioned this pull request May 9, 2021

[ci][docs] Restrict Sphinx version #4267

Merged

Merge branch 'master' into r/encode-char

63d7acd

jameslamb merged commit c1d2dbe into microsoft:master May 9, 2021

jameslamb deleted the r/encode-char branch May 9, 2021 22:29

StrikerRUS mentioned this pull request May 15, 2021

[python] handle arbitrary length feature names in Python-package #4293

Merged

jameslamb mentioned this pull request Aug 25, 2021

[R-Package] Extremely long column names cause error "[LightGBM] [Fatal] Check failed: (reserved_string_size) >= (required_string_size) at lightgbm_R.cpp, line 177" #4556

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] move creation of character vectors in some methods to C++ side #4256

[R-package] move creation of character vectors in some methods to C++ side #4256

jameslamb commented May 5, 2021 •

edited

Loading

jameslamb commented May 5, 2021 •

edited by guolinke

Loading

jameslamb commented May 5, 2021 •

edited by guolinke

Loading

StrikerRUS left a comment

StrikerRUS May 6, 2021

jameslamb May 6, 2021

jameslamb May 7, 2021 •

edited

Loading

StrikerRUS May 8, 2021

jameslamb May 9, 2021

jameslamb commented May 7, 2021 •

edited

Loading

jameslamb commented May 7, 2021 •

edited by guolinke

Loading

jameslamb commented May 7, 2021 •

edited by guolinke

Loading

StrikerRUS left a comment

jameslamb commented May 9, 2021 •

edited by guolinke

Loading

jameslamb commented May 9, 2021 •

edited by guolinke

Loading

StrikerRUS left a comment

StrikerRUS commented May 9, 2021

github-actions bot commented Aug 23, 2023

	if reserved_string_buffer_size < required_string_buffer_size.value:
	raise BufferError(
	"Allocated feature name buffer size ({}) was inferior to the needed size ({})."
	.format(reserved_string_buffer_size, required_string_buffer_size.value)
	)

	if reserved_string_buffer_size < required_string_buffer_size.value:
	raise BufferError(
	"Allocated eval name buffer size ({}) was inferior to the needed size ({})."
	.format(reserved_string_buffer_size, required_string_buffer_size.value)
	)

[R-package] move creation of character vectors in some methods to C++ side #4256

[R-package] move creation of character vectors in some methods to C++ side #4256

Conversation

jameslamb commented May 5, 2021 • edited Loading

References

jameslamb commented May 5, 2021 • edited by guolinke Loading

jameslamb commented May 5, 2021 • edited by guolinke Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS May 6, 2021

Choose a reason for hiding this comment

jameslamb May 6, 2021

Choose a reason for hiding this comment

jameslamb May 7, 2021 • edited Loading

Choose a reason for hiding this comment

StrikerRUS May 8, 2021

Choose a reason for hiding this comment

jameslamb May 9, 2021

Choose a reason for hiding this comment

jameslamb commented May 7, 2021 • edited Loading

jameslamb commented May 7, 2021 • edited by guolinke Loading

jameslamb commented May 7, 2021 • edited by guolinke Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented May 9, 2021 • edited by guolinke Loading

jameslamb commented May 9, 2021 • edited by guolinke Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented May 9, 2021

github-actions bot commented Aug 23, 2023

jameslamb commented May 5, 2021 •

edited

Loading

jameslamb commented May 5, 2021 •

edited by guolinke

Loading

jameslamb commented May 5, 2021 •

edited by guolinke

Loading

jameslamb May 7, 2021 •

edited

Loading

jameslamb commented May 7, 2021 •

edited

Loading

jameslamb commented May 7, 2021 •

edited by guolinke

Loading

jameslamb commented May 7, 2021 •

edited by guolinke

Loading

jameslamb commented May 9, 2021 •

edited by guolinke

Loading

jameslamb commented May 9, 2021 •

edited by guolinke

Loading