Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test suite: eliminate bug class "stray processes after test exits" #6485

Open
1 of 6 tasks
problame opened this issue Jan 26, 2024 · 4 comments
Open
1 of 6 tasks

test suite: eliminate bug class "stray processes after test exits" #6485

problame opened this issue Jan 26, 2024 · 4 comments
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@problame
Copy link
Contributor

problame commented Jan 26, 2024

Problem

Some tests leave stay processes behind after they exit.

This is the potential root cause for failed coverage-report generation, as well as other flakiness.

DoD

The Python test suite ensures that after each test function exit, there are no stray subprocesses left.

If there are any, the processes' argv are listed at WARNING level and the test fails, preventing the PR from being merged.

Related Issues

Work

@problame problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic a/test Area: related to testing labels Jan 26, 2024
problame added a commit that referenced this issue Jan 26, 2024
Epic: #6485

Before this PR, some tests would leak child processes.
We found them using the approach in
#6470.

This PR fixes the findings because PR#6470 is being delayed due to
security concerns.
@jcsp
Copy link
Collaborator

jcsp commented Jan 29, 2024

Update:

  • The cgroup approach bumped up against securiy concerns.
  • Putting on pause.

@problame
Copy link
Contributor Author

problame commented Feb 6, 2024

bayandin added a commit that referenced this issue Feb 19, 2024
## Problem

The merging coverage data step recently started to be too flaky.
This failure blocks staging deployment and along with the flakiness of
regression tests might require 4-5-6 manual restarts of a CI job.

Refs:
- #4540
- #6485
- https://neondb.slack.com/archives/C059ZC138NR/p1704131143740669

## Summary of changes
- Disable code coverage report for functional tests
@koivunej
Copy link
Member

koivunej commented May 6, 2024

An alternative I've been thinking about, much more hacky way:

  • make the pytest runner process a subprocess reaper
  • after each test report and kill any children and fail

This would have a lot more insecurity than the original cgroup idea, and I am unsure about the license of the existing ctypes based prctl bindings. We could create this utility in rust quite easily.

Upside is that this requires no priviledges as far as I know.

koivunej added a commit that referenced this issue Aug 13, 2024
After #8655 we've had a few issues (mostly tracked on #8708) with the
graceful shutdown. In order to shutdown more of the processes and catch
more errors, for example, from all pageservers, do an immediate shutdown
for those nodes which fail the initial (possibly graceful) shutdown.

Cc: #6485
@koivunej
Copy link
Member

With #8742 it would appear that for a single run, we would no longer leak processes, possibly affected by #8714.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

4 participants