test suite: eliminate bug class "stray processes after test exits" #6485

problame · 2024-01-26T10:02:00Z

Problem

Some tests leave stay processes behind after they exit.

This is the potential root cause for failed coverage-report generation, as well as other flakiness.

DoD

The Python test suite ensures that after each test function exit, there are no stray subprocesses left.

If there are any, the processes' argv are listed at WARNING level and the test fails, preventing the PR from being merged.

Related Issues

Work

Implementation

Give feedback

Epic: #6485 Before this PR, some tests would leak child processes. We found them using the approach in #6470. This PR fixes the findings because PR#6470 is being delayed due to security concerns.

jcsp · 2024-01-29T11:19:37Z

Update:

The cgroup approach bumped up against securiy concerns.
Putting on pause.

problame · 2024-02-06T14:57:47Z

Paused until https://neondb.slack.com/archives/C059ZC138NR/p1707229781663129?thread_ts=1706265241.134019&cid=C059ZC138NR is resolved

## Problem The merging coverage data step recently started to be too flaky. This failure blocks staging deployment and along with the flakiness of regression tests might require 4-5-6 manual restarts of a CI job. Refs: - #4540 - #6485 - https://neondb.slack.com/archives/C059ZC138NR/p1704131143740669 ## Summary of changes - Disable code coverage report for functional tests

koivunej · 2024-05-06T13:20:19Z

An alternative I've been thinking about, much more hacky way:

make the pytest runner process a subprocess reaper
after each test report and kill any children and fail

This would have a lot more insecurity than the original cgroup idea, and I am unsure about the license of the existing ctypes based prctl bindings. We could create this utility in rust quite easily.

Upside is that this requires no priviledges as far as I know.

After #8655 we've had a few issues (mostly tracked on #8708) with the graceful shutdown. In order to shutdown more of the processes and catch more errors, for example, from all pageservers, do an immediate shutdown for those nodes which fail the initial (possibly graceful) shutdown. Cc: #6485

koivunej · 2024-08-16T06:32:19Z

With #8742 it would appear that for a single run, we would no longer leak processes, possibly affected by #8714.

problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic a/test Area: related to testing labels Jan 26, 2024

problame assigned bayandin, problame and koivunej Jan 26, 2024

problame mentioned this issue Jan 26, 2024

feat(test suite): use cgroups to detect if a test leaks processes #6470

Open

problame mentioned this issue Jan 26, 2024

fix(test suite): some tests leak child processes #6497

Merged

bayandin mentioned this issue Feb 18, 2024

CI: temporary disable coverage report for regression tests #6798

Merged

5 tasks

koivunej mentioned this issue Aug 13, 2024

test: do better job of shutting everything down #8714

Merged

koivunej mentioned this issue Aug 16, 2024

CI(build-and-test): collect code-coverage for regression tests #8742

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test suite: eliminate bug class "stray processes after test exits" #6485

test suite: eliminate bug class "stray processes after test exits" #6485

problame commented Jan 26, 2024 •

edited

Loading

Implementation

jcsp commented Jan 29, 2024

problame commented Feb 6, 2024

koivunej commented May 6, 2024

koivunej commented Aug 16, 2024

test suite: eliminate bug class "stray processes after test exits" #6485

test suite: eliminate bug class "stray processes after test exits" #6485

Comments

problame commented Jan 26, 2024 • edited Loading

Problem

DoD

Related Issues

Work

Implementation

jcsp commented Jan 29, 2024

problame commented Feb 6, 2024

koivunej commented May 6, 2024

koivunej commented Aug 16, 2024

problame commented Jan 26, 2024 •

edited

Loading