-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_sharded_ingest timeouts during pageserver shutdown #9740
Comments
It works just fine manually at least:
I'll dig a bit further. |
It's trying to stop it twice -- first with SIGTERM, then with SIGQUIT. This won't work, because there's a common signal handler for SIGTERM and SIGQUIT, and the SIGTERM got there first so it's busy with that. The test runner may expect it to be responsive to both. I'll check.
|
Ok, so neon/test_runner/fixtures/neon_fixtures.py Lines 1238 to 1246 in 5330122
|
## Problem `test_sharded_ingest` ingests a lot of data, which can cause shutdown to be slow e.g. due to local "S3 uploads" or compactions. This can cause test flakes during teardown. Resolves #9740. ## Summary of changes Perform an immediate shutdown of the cluster.
## Problem The Pageserver signal handler would only respond to a single signal and initiate shutdown. Subsequent signals were ignored. This meant that a `SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a slow or stalled shutdown). The `test_runner` uses this to force shutdown if graceful shutdown is slow. Touches #9740. ## Summary of changes Keep responding to signals after the initial shutdown signal has been received. Arguably, the `test_runner` should also use `SIGKILL` rather than `SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT` regardless.
## Problem `test_sharded_ingest` ingests a lot of data, which can cause shutdown to be slow e.g. due to local "S3 uploads" or compactions. This can cause test flakes during teardown. Resolves #9740. ## Summary of changes Perform an immediate shutdown of the cluster.
## Problem The Pageserver signal handler would only respond to a single signal and initiate shutdown. Subsequent signals were ignored. This meant that a `SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a slow or stalled shutdown). The `test_runner` uses this to force shutdown if graceful shutdown is slow. Touches #9740. ## Summary of changes Keep responding to signals after the initial shutdown signal has been received. Arguably, the `test_runner` should also use `SIGKILL` rather than `SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT` regardless.
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9678/11802652055/index.html#suites/c4b5b5cc329a950d1fae768ab6cbdaf5/6f2d1e56efd29eba
The backtrace in test code indicates that it's setting
immediate=True
which should be a SIGKILL, but the pageserver logs look like a graceful shutdown and it reports getting a SIGTERM:This test might be exposing a bug in how we do shutdowns in test teardown?
The text was updated successfully, but these errors were encountered: