Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Improve sk<->ps connection observability #7002

Open
5 tasks
Tracked by #9329
petuhovskiy opened this issue Mar 4, 2024 · 0 comments
Open
5 tasks
Tracked by #9329

Epic: Improve sk<->ps connection observability #7002

petuhovskiy opened this issue Mar 4, 2024 · 0 comments
Labels
c/storage/pageserver Component: storage: pageserver c/storage/safekeeper Component: storage: safekeeper t/Epic Issue type: Epic

Comments

@petuhovskiy
Copy link
Member

Motivation

During one of the deploys we saw some projects were stuck for several minutes. There were errors like this in the logs:

query handler for 'basebackup X Y 0/1EF9A10 --gzip' failed: Timed out while waiting for WAL record at LSN 0/1EF9A10 to arrive, last_record_lsn 0/14EEA60 disk consistent LSN=0/14EEA60, WalReceiver status: Not active

I tried to find something relevant in the logs and metrics, but they were mostly empty without any hints.

DoD

I think we should add more context in the logs, more metrics and print broker status in the logs.

Implementation ideas

  • Initialize manager_status right after WalReceiver creation
  • Don't write WalReceiver status: Not active, instead write the timestamp of the last message received from the broker
  • Add global metrics about the broker to the pageserver (is the pageserver connected to broker, when was the last message received from the broker)
  • When pageserver disconnects from the safekeeper, safekeeper should log first and last LSN that was sent/received
  • Safekeeper should log hostname/NodeId of the connected pageserver in the logs

Tasks

Preview Give feedback
No tasks being tracked yet.

Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver c/storage/safekeeper Component: storage: safekeeper t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

2 participants