Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Fix R memory leaks (fixes #4282, fixes #3462) #4597

Merged
merged 8 commits into from
Sep 17, 2021

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Sep 6, 2021

Fixes #4282

This PR addresses a couple sources of potential memory leaks in the R interface:

  • Changes the logger to print R error messages instead of throwing errors directly, thus avoiding long jumps in the middle of C++ functions (which would be triggered by R errors).
  • Adds unwind protection around R allocations that happen after C++ allocations that need to call a destructor.
  • Expands the try-catch blocks to catch also C++ exceptions, which if thrown from an R-called C function can produce leaks in the best case and segfaults in the worst case.
    * Moves the order of header includes so as to have the R ones at the end as suggested in the R extensions manual, in order to avoid any possible misinitialization of non-R code.

@david-cortes
Copy link
Contributor Author

david-cortes commented Sep 6, 2021

@jameslamb I see the linter complains about system headers being included before R headers, but the R extensions manual recommends doing it like that (e.g. in section 6) and in the past it has happened that some compiler updates cause strange breakage in packages when R headers are included before system ones like the linter demands (and it will BTW have an issue with clang13 if any of those LGBM headers include omp.h).

Was this for some code aesthetic reason, or is there some issue with the order of header includes that the linter tries to prevent?

@david-cortes
Copy link
Contributor Author

Including <R.h> in utils/log.h causes compilation errors in windows due to conflicting function or macro names between R headers and other headers (Realloc and Free). This is most likely due to the header inclusion order.

I think it is enough with just declaring the function prototypes from <R.h> that will be used instead of including the full header (at least compiles and loads on my setup) but I'm not 100% sure about it.

@jameslamb
Copy link
Collaborator

🎉 thanks for this awesome contribution, really exciting!

I see the linter complains about system headers being included before R headers... is there some issue with the order of header includes that the linter tries to prevent?

After a little investigation, I see that you're referring to these specific errors from build https://github.com/microsoft/LightGBM/runs/3528466185?check_suite_focus=true.

R-package/src/lightgbm_R.cpp:15:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]
R-package/src/lightgbm_R.cpp:16:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]
R-package/src/lightgbm_R.cpp:17:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]
R-package/src/lightgbm_R.cpp:18:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]
R-package/src/lightgbm_R.cpp:20:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]
R-package/src/lightgbm_R.cpp:24:  Found C system header after other header. Should be: lightgbm_R.h, c system, c++ system, other.  [build/include_order] [4]

Yes, cpplint tries to enforce the Google style guide for C++ code. See #3132 (comment)

Since it seems you've removed the header-reordering stuff from this PR, it seems those changes are not strictly related to this one, and could be proposed in a separate PR. If you think cpplint's decision here could lead to issues, we'd welcome a separate PR with the header-reordering for the R package, a link to the specific advice you're referring to from "Writing R Extensions", and a link to any other evidence you can provide that a different ordering is preferable. It's possible to add line-level comments to to suppress false positives from cpplint.

I think it is enough with just declaring the function prototypes from

I'll have to investigate a bit to see if just including the template for R_FlushConsole() in log.h works as expected, thanks for calling it out as worthy of extra attention.

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 7, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1207702250

Status: failure ❌.

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 7, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1207702678

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-6b3be38f3c0c4923ae65d045fd766ae6
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-32e2561ab48d47689fa9262731f9901d
Reports also have been sent to LightGBM public e-mail: http://www.yopmail.com/lightgbm_rhub_checks
Status: success ✔️.

@david-cortes
Copy link
Contributor Author

/gha run r-valgrind

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 7, 2021

@david-cortes thanks for looking into the valgrind issues. Right now only maintainers can trigger new builds of that check (and other comment-triggered actions in this repo), so I'll do that for you.

Alternatively, you can try running it manually using Docker. You can see the relevant setup at https://github.com/microsoft/LightGBM/blob/64f15005040aafa528bbec7f873364d94b7e36a4/.github/workflows/r_valgrind.yml

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 7, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1207890737

Status: failure ❌.

@StrikerRUS StrikerRUS added the fix label Sep 7, 2021
@jameslamb jameslamb changed the title [R-package] Fix R memory leaks [R-package] Fix R memory leaks (fixes #4282) Sep 7, 2021
@david-cortes
Copy link
Contributor Author

Well, valgrind this time does not report any "definitely lost" or any "indirectly lost" memory. The only thing that I see is a "possibly lost" which is the memory allocated by the C++ exception objects like std::runtime_error when they are thrown, and I'm not 100% sure that those are actual leaks since libstdc++ is also supposed to be managing exceptions (not sure where exactly and at which moment are the exception objects supposed to be destructed though).

There's also other "possibly lost" coming from GNU's openmp, and I'd guess those are false alarms since valgrind and libgomp don't play along. The RDvalgrind build is BTW not supposed to use openmp in packages.

It additionally has some additional warnings coming from outside the package, such as from R's grep, which may be false alarams, or may be actual cases of some R string being created from uninitialized memory:

==3480== Conditional jump or move depends on uninitialised value(s)
==3480==    at 0x49D0CB1: gregexpr_Regexc (grep.c:2429)
==3480==    by 0x49D39AD: do_regexpr (grep.c:3066)
==3480==    by 0x49A0D06: bcEval (eval.c:7128)
==3480==    by 0x498C390: Rf_eval (eval.c:740)
==3480==    by 0x498F0FF: R_execClosure (eval.c:1910)
==3480==    by 0x498EDB2: Rf_applyClosure (eval.c:1836)
==3480==    by 0x49A093A: bcEval (eval.c:7096)
==3480==    by 0x498C390: Rf_eval (eval.c:740)
==3480==    by 0x498BEDC: forcePromise (eval.c:568)
==3480==    by 0x4997059: FORCE_PROMISE (eval.c:5149)
==3480==    by 0x4997214: getvar (eval.c:5190)
==3480==    by 0x499DE53: bcEval (eval.c:6880)
==3480==  Uninitialised value was created by a stack allocation
==3480==    at 0x49D07B0: gregexpr_Regexc (grep.c:2359)
==3480== 

@david-cortes
Copy link
Contributor Author

After some more investigation, it seems in theory a standards-compliant compiler should not produce any memory leak from std::runtime_error + longjmp, but in practice some compilers (tried GCC and CLANG) may leak it if the jump is not within the same function call.

I've pushed a workaround which should fix it by using a global variable with a string buffer, which I guess should solve the "issue" but it's perhaps not as ideal (now it copies messages from a fixed buffer, to an exception, to another fixed buffer).

@david-cortes
Copy link
Contributor Author

david-cortes commented Sep 7, 2021

@jameslamb Another unrelated thing: the R examples and tests are not supposed to use more than 1 or 2 threads. One of the tests currently calls function LGBM_DatasetCreateFromCSC from the C api, whose function definition contains this line:

#pragma omp parallel for schedule(static)

... which would make it run with the maximum available number of threads.

I guess the proper solution would be to modify that C function to take the number of threads as a parameter, but that'd be bigger then this PR as it's also used by the other interfaces.

@jameslamb
Copy link
Collaborator

... maximum available number of threads

thanks for opening #4598 to document this outside of this PR!

After some more investigation... I've pushed a workaround

wow awesome, thanks! I'll trigger another valgrind build to test that.

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 7, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1210889597

Status: success ✔️.

@david-cortes
Copy link
Contributor Author

Looks like the memory leaks are fixed now. What remains from the valgrind logs (outside the scope of this PR) is to avoid multithreading in the examples (#4598) and to investigate the R string that's being used uninitialized (no idea about it and don't know which example is triggering it).

The other check that is failing is unrelated to this PR since it's about the Python version.

@jameslamb
Copy link
Collaborator

What remains from the valgrind logs...outside the scope of this PR

Yep, excellent! We've already gotten CRAN's agreement in the past that remaining "possibly lost" warning is allowable based on their policies, so it's intentionally allowed in our tests here.

==3478== 336 bytes in 1 blocks are possibly lost in loss record 147 of 2,542
==3478==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3478==    by 0x40149CA: allocate_dtv (dl-tls.c:286)
==3478==    by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
==3478==    by 0x5724322: allocate_stack (allocatestack.c:622)
==3478==    by 0x5724322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==3478==    by 0x56F2DEA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==3478==    by 0x56EA8E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==3478==    by 0x151B6C5B: LGBM_DatasetCreateFromCSC (c_api.cpp:1343)
==3478==    by 0x151E6284: LGBM_DatasetCreateFromCSC_R (lightgbm_R.cpp:135)
==3478==    by 0x49423D7: R_doDotCall (dotcode.c:627)
==3478==    by 0x494D38B: do_dotcall (dotcode.c:1284)
==3478==    by 0x498CA40: Rf_eval (eval.c:843)
==3478==    by 0x4992DAA: do_set (eval.c:2982)
==3478==    by 0x498C7E7: Rf_eval (eval.c:815)

# one error caused by a false positive between valgrind and openmp is allowed
# ==2063== 336 bytes in 1 blocks are possibly lost in loss record 153 of 2,709
# ==2063== at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
# ==2063== by 0x40149CA: allocate_dtv (dl-tls.c:286)
# ==2063== by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
# ==2063== by 0x5702322: allocate_stack (allocatestack.c:622)
# ==2063== by 0x5702322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
# ==2063== by 0x56D0DDA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
# ==2063== by 0x56C88E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
# ==2063== by 0x1544D29C: LGBM_DatasetCreateFromCSC (c_api.cpp:1286)
# ==2063== by 0x1546F980: LGBM_DatasetCreateFromCSC_R (lightgbm_R.cpp:91)
# ==2063== by 0x4941E2F: R_doDotCall (dotcode.c:634)
# ==2063== by 0x494CCC6: do_dotcall (dotcode.c:1281)
# ==2063== by 0x499FB01: bcEval (eval.c:7078)
# ==2063== by 0x498B67F: Rf_eval (eval.c:727)
# ==2063== by 0x498E414: R_execClosure (eval.c:1895)
bytes_possibly_lost=$(
cat ${VALGRIND_LOGS_FILE} \
| grep -E "possibly lost\: .*" \
| sed 's/^.*possibly lost\: \(.*\) bytes.*$/\1/' \
| tr -d ","
)
echo "valgrind found ${bytes_possibly_lost} bytes possibly lost"
if [[ ${bytes_possibly_lost} -gt 336 ]]; then
exit -1
fi

The other check that is failing is unrelated to this PR since it's about the Python version.

yep, you are right! We have some ongoing issues in CI with a build that tries to emulate ARM64 architectures using QEMU. #4594 tracks it, if you're curious.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, thanks so much for this awesome contribution!!!!!

I'm extra happy that it looks like we'll be able to get this into the next release (#4310) 😍

How I tested this

  1. Checked CI logs for new warnings or other...didn't see any.

  2. Tried running one of the tests that triggers an error from the C++ side (the tests whose testthat::skip() were removed in this PR). Confirmed that the expected output is printed to the R console (both in RStudio and running a script from a terminal with Rscript), and that an R exception is raised.

image

  1. triggered Solaris and valgrind CI checks. Both passed.

Given all that, I'm very confident in this fix and really happy to accept it! But I'd like one more review (@StrikerRUS, whenever you have time) before we merged since this change touches the core of the R package.

Extra links to help in reviewing:

@jameslamb
Copy link
Collaborator

@david-cortes sorry for the inconvenience, but could you please merge the latest master into this branch? That should fix the unrelated CI failures, now that we've merged #4600.

@jameslamb jameslamb mentioned this pull request Sep 10, 2021
21 tasks
@jameslamb
Copy link
Collaborator

jameslamb commented Sep 10, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1221525972

Status: success ✔️.

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 10, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1221527013

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-a657d172ea9f425f875bc5fdfecf0eb7
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-4c1a730c2902433cb334da64ed888e0d
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

@david-cortes
Copy link
Contributor Author

On second thought, these changes would still leave open a very small chance of leaking memory if a continuation token fails to allocate, which should be very improbable but is in theory possible. I’ve modified it now to avoid potentially leaking if that happens, would be good to run valgrind again just in case.

@david-cortes
Copy link
Contributor Author

david-cortes commented Sep 12, 2021

@jameslamb I’m fairly certain that this r-rchk test is throwing a false alarm. It doesn’t seem to work correctly when R dynamic objects are managed through C++ constructors and destructors.

Nvermind, found a way around it by avoiding C++-type code.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Sep 12, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1227299319

Status: success ✔️.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Sep 12, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1227300060

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-fc601db5613e465cadbf604135889147
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-ad71d7c79f4d4b398b43d850dc9f80f7
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

Comment on lines 38 to 39
catch(std::exception& ex) { LGBM_R_save_exception_msg(ex); goto throw_R_errormsg; } \
catch(std::string& ex) { LGBM_R_save_exception_msg(ex); goto throw_R_errormsg; } \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid using goto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Was added just for clarity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own knowledge @StrikerRUS , what is problematic about goto?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb I'm one of those who are against goto due to it leads to spaghetti code in general.
https://stackoverflow.com/a/3517765
https://en.wikipedia.org/wiki/Goto#Criticism

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, thanks!

@StrikerRUS
Copy link
Collaborator

@Laurae2 Are you able to provide a review for this PR?

@StrikerRUS StrikerRUS changed the title [R-package] Fix R memory leaks (fixes #4282) [R-package] Fix R memory leaks (fixes #4282, fixes #3462) Sep 12, 2021
@jameslamb
Copy link
Collaborator

jameslamb commented Sep 12, 2021

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1227502757

Status: success ✔️.

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 12, 2021

/gha run r-solaris

Workflow Solaris CRAN check has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/1227504430

solaris-x86-patched: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-e240edd218b3453a883d13cf510fdda7
solaris-x86-patched-ods: https://builder.r-hub.io/status/lightgbm_3.2.1.99.tar.gz-fd9108e77e55446abc77862be514083c
Reports also have been sent to LightGBM public e-mail: https://yopmail.com?lightgbm_rhub_checks
Status: success ✔️.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ LGTM, thanks a lot for the great help! I don't have any comments or suggestions for this PR, but I think that formal approval should be done by R-package maintainer.

@jameslamb
Copy link
Collaborator

I think that formal approval should be done by R-package maintainer.

thanks! I'll review the new changes tonight

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just re-reviewed the changes and the all look good to me!

@david-cortes thank you SO MUCH for picking up this work and for working with us to keep it compliant with CRAN's requirements around Solaris and valgrind. I'm very excited that we'll be able to get this one into the upcoming 3.3.0 release, and I definitely couldn't have figured this out by myself.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-package] R memory leaks
3 participants