Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyln-testing: Fix BitcoinD FD leak #7669

Conversation

s373nZ
Copy link
Contributor

@s373nZ s373nZ commented Sep 14, 2024

Fix BitcoinD FD leak in pyln-testing

Description

This PR adds a post-yield resolver sanity-check to the bitcoind test fixture to be sure file descriptors are not leaking.

Related Issues

Changes Made

  • Feature: Brief description of the new feature or functionality added.
  • Bug Fix: Brief description of the bug fixed and how it was resolved.
  • Refactor: Any code improvements or refactoring done without changing the functionality.

Checklist

Ensure the following tasks are completed before submitting the PR:

  • Changelog has been added in relevant commit/s.
  • Tests have been added or updated to cover the changes.
  • Documentation has been updated as needed.
  • Any relevant comments or TODOs have been addressed or removed.

Additional Notes

Please let me know if there is a more elegant way to do this, or more appropriate place to put it.

@s373nZ s373nZ force-pushed the 7130-fix-pyln-testing-bitcoind-fd-leak branch from 79ed2f7 to 9b37592 Compare September 14, 2024 19:20
@s373nZ
Copy link
Contributor Author

s373nZ commented Sep 14, 2024

Refactored the file closing operations into a function on TailableProc and call it from the bitcoind fixture. Plus, another successful CI run would be nice...

@s373nZ s373nZ force-pushed the 7130-fix-pyln-testing-bitcoind-fd-leak branch 2 times, most recently from f680adf to 8ada6a6 Compare September 16, 2024 10:59
@s373nZ s373nZ marked this pull request as ready for review September 16, 2024 11:53
CHANGELOG.md Outdated Show resolved Hide resolved
@s373nZ s373nZ force-pushed the 7130-fix-pyln-testing-bitcoind-fd-leak branch from 8ada6a6 to 187bd82 Compare September 17, 2024 10:30
…sProject#7130])

Changelog-Fixed: pyln-testing: Fix file descriptor leak in bitcoind fixture. ([ElementsProject#7130])
@s373nZ s373nZ force-pushed the 7130-fix-pyln-testing-bitcoind-fd-leak branch from 187bd82 to e35e3c5 Compare September 17, 2024 10:35
@s373nZ
Copy link
Contributor Author

s373nZ commented Sep 17, 2024

Rebased and removed modifications to CHANGELOG for Changelog-Fixed: hint in commit message.

@ShahanaFarooqui
Copy link
Collaborator

ShahanaFarooqui commented Sep 17, 2024

@s373nZ Sorry if I overlooked it while jumping between 7669, 7108, 7130, and 7310 :).
But could you please remind why the cleanup process is not added to the finalizer or the stop function?

I am testing it with:

  • 1st terminal: make pytest
  • 2nd terminal: Get the process id with ps aux | grep pytest
  • 3rd terminal: lsof -p ${PID} | wc -l

And it seems to be working (tests are still running at 32% :-|).

@s373nZ
Copy link
Contributor Author

s373nZ commented Sep 18, 2024

@ShahanaFarooqui That's a lot of jumping (and a lot of 7s :). I should have done a better job at consolidating this information into the PR description.

Context

For some context, I ran into this file descriptor issue a few months ago when working on a PR and attempting to run the make pytest locally. The leak produced seemingly random test failures that weren't alleviated until I lifted the ulimit. New to CLN development, I was troubleshooting the issue then under the assumption that the LightningD subprocess was responsible, and tried a wide variety of fixes. Returning to the issue recently in hopes that a fix might stabilize the CI tests somewhat, fresh eyes revealed it is actually the BitcoinD process.

To address your questions directly:

could you please remind why the cleanup process is not added to the #7130 (comment)

My reading of the Pytest documentation suggested by @cdecker in that comment, specifically the preceding section titled Teardown/Cleanup (AKA Fixture finalization), suggests that yield fixtures are recommended and Any teardown code for that fixture is placed after the yield. The bitcoind fixture is already implemented as a yield fixture, so closing the files as a cleanup operation performed after the yield statement in the current implementation aligns with best practice recommendations as I understand it.

The following section in the docs, Adding finalizers directly reads to apply typically to non-yield fixtures. Seeing as this is actually an issue with every test case which leverages the bitcoind fixture (each leaves four fds open in a (deleted) state throughout the entire pytest suite run), I understand the file closing operations to be necessarily bundled with the yield fixture. I think at the time, @cdecker and I thought that the leak might have affected a subset of test cases and were exploring addfinalizer() as a targeted solution.

In short, the implementation in this PR is indeed using finalizers, but in accordance with yield fixtures rather than addfinalizer.

...or the stop function?

This is a good question, and one I tried to implement initially. TailableProc is inherited by both BitcoinD and LightningD. There are a number of existing tests which stop() and start() the Lightning nodes, and continue to make use of the files throughout the tests to validate log messages. If we close these files when the daemon is stopped, these tests break. Additionally, there are a number of global "teardown checks" in the node_factory fixture itself which depend on the log files to remain open. After trying a wide variety of implementations using stop() and even moving the file open() calls out of the initializer into start(), it became apparent that this approach was too invasive with the current test suite.

Now that it's discovered BitcoinD is the leak culprit, there is still a small mystery as to why LightningD isn't leaking file descriptors in the same manner. My best understanding at the moment is that its daemon is started in a subprocess using Popen() here, and according to the docs:

If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.

Changed in version 3.2: The default for close_fds was changed from False to what is described above.

So, I believe the subprocess is taking care of closing LightningD's file descriptors by default, but BitcoinD doesn't get the same treatment.

Alternative implementations

  • While closing the files in stop() doesn't seem to be the right solution, there may be other options. Another idea was to implement it in the TailableProc destructor __del__(), but I'm pretty sure I tried that last March and didn't get it working.

  • Also, I considered the name cleanup() for the callback function in TailableProc, which is a more general name. However, that function would be implicitly overridden by LightningD here when called by LightningNode here. This could probably work, too, but it was preferable to avoid unexpected behavior by injecting assumptions into the existing inheritance hierarchy.

Hope this helps! I haven't mastered Pytest, Pyln, or the CLN test suite by any means, so please let me know if you have recommendations for a better fit solution.

@s373nZ
Copy link
Contributor Author

s373nZ commented Sep 18, 2024

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

@ShahanaFarooqui
Copy link
Collaborator

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

Typically, including the issue number in commit messages is useful for tracking purposes. It is recommended to reference issues using related, fixes, or closes in the PR description as well.

CHANGELOG list update:

Before each release, the changelog.py script generates entries based on commits merged after the last tag.

For instance, the script will produce below entry for your commit message, "Changelog-Fixed: pyln-testing: Fix file descriptor leak in bitcoind fixture.":

...
### Fixed
 - pyln-testing: Fix file descriptor leak in bitcoind fixture. ([#7669])

... 
[#7669]: https://github.com/ElementsProject/lightning/pull/7669

This way, your changes are automatically included in the appropriate section of the changelog.

@ShahanaFarooqui
Copy link
Collaborator

ACK e35e3c5.

@ShahanaFarooqui
Copy link
Collaborator

I will merge the PR after final review from @cdecker.

@ShahanaFarooqui ShahanaFarooqui added this to the v24.11 milestone Oct 4, 2024
@ShahanaFarooqui ShahanaFarooqui merged commit cbe94bc into ElementsProject:master Oct 4, 2024
38 checks passed
@s373nZ s373nZ deleted the 7130-fix-pyln-testing-bitcoind-fd-leak branch October 14, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pyln-testing: BitcoinD TailableProc leaking file descriptors
2 participants