Skip to content

Commit

Permalink
Fix error message if PS connection is lost while receiving prefetch (#…
Browse files Browse the repository at this point in the history
…9923)

If the pageserver connection is lost while receiving the prefetch
request, the prefetch queue is cleared. The error message prints the
values from the prefetch slot, but because the slot was already cleared,
they're all zeros:

LOG: [NEON_SMGR] [shard 0] No response from reading prefetch entry 0:
0/0/0.0 block 0. This can be caused by a concurrent disconnect

To fix, make local copies of the values.

In the passing, also add a sanity check that if the receive() call
succeeds, the prefetch slot is still intact.
  • Loading branch information
hlinnaka authored Dec 6, 2024
1 parent c42c28b commit b6eea65
Showing 1 changed file with 30 additions and 5 deletions.
35 changes: 30 additions & 5 deletions pgxn/neon/pagestore_smgr.c
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,9 @@ prefetch_read(PrefetchRequest *slot)
{
NeonResponse *response;
MemoryContext old;
BufferTag buftag;
shardno_t shard_no;
uint64 my_ring_index;

Assert(slot->status == PRFS_REQUESTED);
Assert(slot->response == NULL);
Expand All @@ -623,11 +626,29 @@ prefetch_read(PrefetchRequest *slot)
slot->status, slot->response,
(long)slot->my_ring_index, (long)MyPState->ring_receive);

/*
* Copy the request info so that if an error happens and the prefetch
* queue is flushed during the receive call, we can print the original
* values in the error message
*/
buftag = slot->buftag;
shard_no = slot->shard_no;
my_ring_index = slot->my_ring_index;

old = MemoryContextSwitchTo(MyPState->errctx);
response = (NeonResponse *) page_server->receive(slot->shard_no);
response = (NeonResponse *) page_server->receive(shard_no);
MemoryContextSwitchTo(old);
if (response)
{
/* The slot should still be valid */
if (slot->status != PRFS_REQUESTED ||
slot->response != NULL ||
slot->my_ring_index != MyPState->ring_receive)
neon_shard_log(shard_no, ERROR,
"Incorrect prefetch slot state after receive: status=%d response=%p my=%lu receive=%lu",
slot->status, slot->response,
(long) slot->my_ring_index, (long) MyPState->ring_receive);

/* update prefetch state */
MyPState->n_responses_buffered += 1;
MyPState->n_requests_inflight -= 1;
Expand All @@ -642,11 +663,15 @@ prefetch_read(PrefetchRequest *slot)
}
else
{
neon_shard_log(slot->shard_no, LOG,
/*
* Note: The slot might no longer be valid, if the connection was lost
* and the prefetch queue was flushed during the receive call
*/
neon_shard_log(shard_no, LOG,
"No response from reading prefetch entry %lu: %u/%u/%u.%u block %u. This can be caused by a concurrent disconnect",
(long)slot->my_ring_index,
RelFileInfoFmt(BufTagGetNRelFileInfo(slot->buftag)),
slot->buftag.forkNum, slot->buftag.blockNum);
(long) my_ring_index,
RelFileInfoFmt(BufTagGetNRelFileInfo(buftag)),
buftag.forkNum, buftag.blockNum);
return false;
}
}
Expand Down

1 comment on commit b6eea65

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7040 tests run: 6726 passed, 5 failed, 309 skipped (full report)


Failures on Postgres 17

Failures on Postgres 15

Failures on Postgres 14

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_pageserver_gc_compaction_smoke[release-pg14] or test_pageserver_gc_compaction_smoke[release-pg15] or test_pageserver_gc_compaction_smoke[release-pg15] or test_pageserver_gc_compaction_smoke[release-pg17] or test_pageserver_gc_compaction_smoke[release-pg17]"
Flaky tests (9)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Test coverage report is not available

The comment gets automatically updated with the latest test results
b6eea65 at 2024-12-06T21:45:09.170Z :recycle:

Please sign in to comment.