Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_sharded_ingest timeouts during pageserver shutdown #9740

Closed
jcsp opened this issue Nov 13, 2024 · 3 comments · Fixed by #9984
Closed

test_sharded_ingest timeouts during pageserver shutdown #9740

jcsp opened this issue Nov 13, 2024 · 3 comments · Fixed by #9984
Assignees
Labels
t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Nov 13, 2024

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9678/11802652055/index.html#suites/c4b5b5cc329a950d1fae768ab6cbdaf5/6f2d1e56efd29eba

The backtrace in test code indicates that it's setting immediate=True which should be a SIGKILL, but the pageserver logs look like a graceful shutdown and it reports getting a SIGTERM:

2024-11-12T18:53:55.701009Z  INFO Got signal SIGTERM. Terminating gracefully in fast shutdown mode

This test might be exposing a bug in how we do shutdowns in test teardown?

@jcsp jcsp added the t/bug Issue Type: Bug label Nov 13, 2024
@erikgrinaker
Copy link
Contributor

It works just fine manually at least:

$ kill -s QUIT 45673
$ tail .neon/pageserver_1/pageserver.log
2024-12-02T15:48:40.018876Z  INFO Got signal SIGQUIT. Terminating in immediate shutdown mode

$ cargo neon pageserver stop --id 1 -m immediate
Stopping pageserver with pid 46608 immediately...
pageserver stopped
$ tail .neon/pageserver_1/pageserver.log
2024-12-02T15:50:05.741718Z  INFO Got signal SIGQUIT. Terminating in immediate shutdown mode

I'll dig a bit further.

@erikgrinaker
Copy link
Contributor

It's trying to stop it twice -- first with SIGTERM, then with SIGQUIT. This won't work, because there's a common signal handler for SIGTERM and SIGQUIT, and the SIGTERM got there first so it's busy with that.

The test runner may expect it to be responsive to both. I'll check.

2024-11-12 18:54:05.712 INFO [neon_cli.py:137] Run ['/tmp/neon/bin/neon_local', 'pageserver', 'stop', '--id=1'] failed:
  stdout:
    Stopping pageserver with pid 1441 gracefully.......
    pageserver has not stopped yet, continuing to wait.....
  stderr:
    pageserver stop failed: pageserver with pid 1441 did not stop in 10s seconds

2024-11-12 18:54:05.713 INFO [neon_cli.py:72] Running command "/tmp/neon/bin/neon_local storage_broker stop"
2024-11-12 18:54:05.820 INFO [neon_cli.py:72] Running command "/tmp/neon/bin/neon_local pageserver stop --id=1 -m immediate"
2024-11-12 18:54:15.841 INFO [neon_cli.py:137] Run ['/tmp/neon/bin/neon_local', 'pageserver', 'stop', '--id=1', '-m', 'immediate'] failed:
  stdout:
    Stopping pageserver with pid 1441 immediately.......
    pageserver has not stopped yet, continuing to wait.....
  stderr:
    pageserver stop failed: pageserver with pid 1441 did not stop in 10s seconds

@erikgrinaker
Copy link
Contributor

Ok, so NeonEnv.stop() first does a graceful SIGTERM stop of pageservers, then an immediate SIGQUIT stop if the first one fails. I'll fix the signal handler to still be responsive to SIGQUIT after receiving a SIGTERM.

try:
pageserver.stop(immediate=immediate)
except RuntimeError:
stop_later.append(pageserver)
self.broker.stop()
# TODO: for nice logging we need python 3.11 ExceptionGroup
for ps in stop_later:
ps.stop(immediate=True)

github-merge-queue bot pushed a commit that referenced this issue Dec 3, 2024
## Problem

`test_sharded_ingest` ingests a lot of data, which can cause shutdown to
be slow e.g. due to local "S3 uploads" or compactions. This can cause
test flakes during teardown.

Resolves #9740.

## Summary of changes

Perform an immediate shutdown of the cluster.
github-merge-queue bot pushed a commit that referenced this issue Dec 3, 2024
## Problem

The Pageserver signal handler would only respond to a single signal and
initiate shutdown. Subsequent signals were ignored. This meant that a
`SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a
slow or stalled shutdown). The `test_runner` uses this to force shutdown
if graceful shutdown is slow.

Touches #9740.

## Summary of changes

Keep responding to signals after the initial shutdown signal has been
received.

Arguably, the `test_runner` should also use `SIGKILL` rather than
`SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT`
regardless.
awarus pushed a commit that referenced this issue Dec 5, 2024
## Problem

`test_sharded_ingest` ingests a lot of data, which can cause shutdown to
be slow e.g. due to local "S3 uploads" or compactions. This can cause
test flakes during teardown.

Resolves #9740.

## Summary of changes

Perform an immediate shutdown of the cluster.
awarus pushed a commit that referenced this issue Dec 5, 2024
## Problem

The Pageserver signal handler would only respond to a single signal and
initiate shutdown. Subsequent signals were ignored. This meant that a
`SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a
slow or stalled shutdown). The `test_runner` uses this to force shutdown
if graceful shutdown is slow.

Touches #9740.

## Summary of changes

Keep responding to signals after the initial shutdown signal has been
received.

Arguably, the `test_runner` should also use `SIGKILL` rather than
`SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT`
regardless.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants