Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.1.x] cst: manual backport of chunk download changes PR 18278 #18854

Merged
merged 8 commits into from
Jun 10, 2024

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Jun 7, 2024

This is a manual backport of PR #18278. The conflicts were isolated to remote_segment_test.cc which was refactored in both the original PR and had a randomized test bucket name added.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

abhijat added 8 commits June 7, 2024 10:36
Chunk waiters are stored in a list which is swapped before processing.
Because chunk downloads take time, it is possible that new waiters may
be registered while some downloads are being processed. The remote
segment may receive a stop request during this interval. If this
happens, the hydration loop will shut down, not processing the
remaining waiters, and the waiters will prevent gate closure of remote
segment, causing a hang.

The workaround cancels pending downloads while exiting the background
loop.

(cherry picked from commit b06281e)
The chunk API is now structured around a background loop similar to the
remote segment. When a download request for a chunk arrives, it is
added to the wait list for the chunk. Then a condvar is triggered,
which wakes up the bg loop and request is issued to remote segment to
download the chunk.

Once the chunk is downloaded, all waiters are notified, which are also
notified when there is an error during download.

(cherry picked from commit e0d4fd7)
The prefetch logic is changed. Instead of downloading a single byte
range for chunk + prefetch and splitting them while writing to disk,
we now schedule separate API calls per chunk. This enables the
consumer to get unblocked faster as soon as the first chunk is
downloaded.

(cherry picked from commit 7bfb6b5)
A new test is added where some chunk downloads fail and the assertion
checks if the results are correctly reflected. The supporting tooling
to fail requests is also added to s3 imposter.

(cherry picked from commit 121e4e4)
Three new tests are added which test the chunk hydration process while
under disruptive actions. Because the bg loop structure does not yield
itself to unit testing easily, we use disruptive actions to test code
paths within the download loop. The following scenarios are added:

* Abruptly stop remote segment during pending downloads to ensure there
  are no hangs and exceptions are reported correctly.
* Delete random chunk files while a large number of random chunk
  hydrations are in progress. This emulates cache eviction and forces
  re-hydration.
* Randomly fail HTTP requests either with retryable or non retryable
  errors.

(cherry picked from commit 1f8efb8)
(cherry picked from commit 0b8cf80)
Even if the test code fails, the background thread must be stopped to
avoid a hung test.

(cherry picked from commit dc87016)
@abhijat abhijat merged commit 5cfe6b8 into redpanda-data:v24.1.x Jun 10, 2024
19 checks passed
@BenPope BenPope added this to the v24.1.8 milestone Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants