Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: suppress error logs in shutdown/detach #4876

Merged
merged 2 commits into from
Aug 2, 2023

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Aug 2, 2023

Problem

Error messages like this coming up during normal operations:

        Compaction failed, retrying in 2s: timeline is Stopping

       Compaction failed, retrying in 2s: Cannot run compaction iteration on inactive tenant

Summary of changes

Add explicit handling for the shutdown case in these locations, to suppress error logs.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

jcsp added 2 commits August 2, 2023 14:19
This suppresses logs like this:
```
    Compaction failed, retrying in 2s: timeline is Stopping
```

...by introducing an explicit ShuttingDown error and using it instead
of a generic error.
This fixes spurious log lines like:
```
   Compaction failed, retrying in 2s: Cannot run compaction iteration on inactive tenant
```

compaction_loop was doing its first-iteration random_init_delay, but
not checking the CancellationToken between that and calling into
copmaction_iteration, which expects to only be called when the
timeline is in an active state.  We should re-check cancellation
token after sleeping in case we were asked to shut down in the
interim.
@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Aug 2, 2023
@github-actions
Copy link

github-actions bot commented Aug 2, 2023

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: debug

@jcsp jcsp marked this pull request as ready for review August 2, 2023 15:16
@jcsp jcsp requested review from a team as code owners August 2, 2023 15:16
@jcsp jcsp requested review from lubennikovaav, skyzh, problame and koivunej and removed request for a team, lubennikovaav and skyzh August 2, 2023 15:16
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_image_layers / compact_level0 might also take a long time to complete and causing some failures. In the future, it would be good to pass the cancellation to them and return CompactionError::Shutdown correspondingly.

@jcsp jcsp merged commit df49a9b into main Aug 2, 2023
@jcsp jcsp deleted the jcsp/compaction-log-hygiene branch August 2, 2023 18:31
@@ -103,6 +103,11 @@ async fn compaction_loop(tenant: Arc<Tenant>, cancel: CancellationToken) {
}
}

if cancel.is_cancelled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a bit racy. Adding the error is good, and we should filter against such values knstead when logging near lines L114R119.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind I missed we turn this into an Ok at the outermost fn. This check is then just extra lines, correct?

Also, still too many info!, zero are needed when shutting down but this PR at least gets rid of one extra stacktrace.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree about the info! being a bit too verbose, I put it in here for consistency with the other cancel checks in the function.

The functional impact of this check is just to avoid calling into compaction_iteration (which emits an error log when state isn't active), rather than calling in there, logging error, returning and then dropping out of this loop further down when we next enter a timeout.

@jcsp jcsp changed the title pagekeeper: suppress error logs in shutdown/detach pageserver: suppress error logs in shutdown/detach Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants