Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: productionize gc-compaction #9114

Open
14 of 28 tasks
skyzh opened this issue Sep 23, 2024 · 4 comments
Open
14 of 28 tasks

Epic: productionize gc-compaction #9114

skyzh opened this issue Sep 23, 2024 · 4 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@skyzh
Copy link
Member

skyzh commented Sep 23, 2024

In #8002, we already finished the functionalities of gc-compaction, and run it successfully over some small tenants in staging. The next step is to run it on larger tenants and production tenants. This involves work to improve the gc-compaction process and add new features.

Functionality

This ensures that the compaction process can clean-up data (that couldn't be cleaned up in legacy compaction) and yield to more important compaction job if it will take a long time.

Misc

Testing

  • Run on large staging tenants
  • Run on inactive production tenants
  • Run on active production tenants
  • Enable auto trigger
@skyzh skyzh added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Sep 23, 2024
@skyzh skyzh self-assigned this Sep 23, 2024
skyzh added a commit that referenced this issue Oct 17, 2024
part of #9114

## Summary of changes

gc-compaction may take a lot of disk space, and if it does, the caller
should do a partial gc-compaction. This patch adds space check for the
compaction job.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Oct 21, 2024

This week: hopefully get #9048 + #9134 merged

skyzh added a commit that referenced this issue Oct 24, 2024
…#9493)

part of #9114,
#8836,
#8362

The split layer writer code can be used in a more general way: the
caller puts unfinished writers into the batch layer writer and let batch
layer writer to ensure the atomicity of the layer produces.

## Summary of changes

* Add batch layer writer, which atomically finishes the layers.
`BatchLayerWriter::finish` is simply a copy-paste from previous split
layer writers.
* Refactor split writers to use the batch layer writer.
* The current split writer tests cover all code path of batch layer
writer.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
skyzh added a commit that referenced this issue Oct 29, 2024
…#9134)

part of #8921,
#9114

## Summary of changes

We start the partial compaction implementation with the image layer
partial generation. The partial compaction API now takes a key range. We
will only generate images for that key range for now, and remove layers
fully included in the key range after compaction.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
skyzh added a commit that referenced this issue Nov 11, 2024
The final patch for partial compaction, part of
#9114, close
#8921 (note that we didn't
implement parallel compaction or compaction scheduler for partial
compaction -- currently this needs to be scheduled by using a Python
script to split the keyspace, and in the future, automatically split
based on the key partitioning when the pageserver wants to trigger a
gc-compaction)

## Summary of changes

* Update the layer selection algorithm to use the same selection as full
compaction (everything intersect/below gc horizon)
* Update the layer selection algorithm to also generate a list of delta
layers that need to be rewritten
* Add the logic to rewrite delta layers and add them back to the layer
map
* Update test case to do partial compaction on deltas

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
skyzh added a commit that referenced this issue Nov 12, 2024
I had an impression that gc-compaction didn't test the case where the
first record of the key history is will_init because of there are some
code path that will panic in this case. Luckily it got fixed in
#9026 so we can now implement
such tests.

Part of #9114

## Summary of changes

* Randomly changed some images into will_init neon wal record
* Split `test_simple_bottom_most_compaction_deltas` into two test cases,
one of them has the bottom layer as delta layer with will_init flags,
while the other is the original one with image layers.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
skyzh added a commit that referenced this issue Nov 18, 2024
close #9552, close
#8920, part of
#9114

## Summary of changes

* Drop keys not belonging to this shard during gc-compaction to avoid
constructing history that might have been truncated during shard
compaction.
* Run gc-compaction at the end of shard compaction test.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
skyzh added a commit that referenced this issue Nov 19, 2024
)

part of #9114, we want to be
able to run partial gc-compaction in tests. In the future, we can also
expand this functionality to legacy compaction, so that we can trigger
compaction for a specific key range.

## Summary of changes

* Support passing compaction key range through pageserver routes.
* Refactor input parameters of compact related function to take the new
`CompactOptions`.
* Add tests for partial compaction. Note that the test may or may not
trigger compaction based on GC horizon. We need to improve the test case
to ensure things always get below the gc_horizon and the gc-compaction
can be triggered.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
github-merge-queue bot pushed a commit that referenced this issue Dec 5, 2024
## Problem

part of #9114

gc-compaction can take a long time. This patch adds support for
scheduling a gc-compaction job. The compaction loop will first handle
L0->L1 compaction, and then gc compaction. The scheduled jobs are stored
in a non-persistent queue within the tenant structure.

This will be the building block for the partial compaction trigger -- if
the system determines that we need to do a gc compaction, it will
partition the keyspace and schedule several jobs. Each of these jobs
will run for a short amount of time (i.e, 1 min). L0 compaction will be
prioritized over gc compaction.

## Summary of changes
 
* Add compaction scheduler in tenant.
* Run scheduled compaction in integration tests.
* Change the manual compaction API to allow schedule a compaction
instead of immediately doing it.
* Add LSN upper bound as gc-compaction parameter. If we schedule partial
compactions, gc_cutoff might move across different runs. Therefore, we
need to pass a pre-determined gc_cutoff beforehand. (TODO: support LSN
lower bound so that we can compact arbitrary "rectangle" in the layer
map)
* Refactor the gc_compaction internal interface.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
## Problem

part of #9114, stacked PR
over #9809

The compaction scheduler now schedules partial compaction jobs.

## Summary of changes

* Add the compaction job splitter based on size.
* Schedule subcompactions using the compaction scheduler.
* Test subcompaction scheduler in the smoke regress test.
* Temporarily disable layer map checks

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
@problame
Copy link
Contributor

problame commented Dec 9, 2024

This week:

  • testing on staging
  • figure out design for automatic trigger

github-merge-queue bot pushed a commit that referenced this issue Dec 9, 2024
## Problem

close #10049, close
#10030, close
#8861

part of #9114

The legacy gc process calls `get_latest_gc_cutoff`, which uses a Rcu
different than the gc_info struct. In the gc_compaction_smoke test case,
the "latest" cutoff could be lower than the gc_info struct, causing
gc-compaction to collect data that could be accessed by
`latest_gc_cutoff`. Technically speaking, there's nothing wrong with
gc-compaction using gc_info without considering latest_gc_cutoff,
because gc_info is the source of truth. But anyways, let's fix it.

## Summary of changes

* gc-compaction uses `latest_gc_cutoff` instead of gc_info to determine
the gc horizon.
* if a gc-compaction is scheduled via tenant compaction iteration, it
will take the gc_block lock to avoid racing with functionalities like
detach ancestor (if it's triggered via manual compaction API without
scheduling, then it won't take the lock)

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Dec 12, 2024
## Problem

part of #9114, stacked PR
over #9897, partially
refactored to help with
#10031

## Summary of changes

* gc-compaction takes `above_lsn` parameter. We only compact the layers
above this LSN, and all data below the LSN are treated as if they are on
the ancestor branch.
* refactored gc-compaction to take `GcCompactJob` that describes the
rectangular range to be compacted.
* Added unit test for this case.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Dec 16, 2024

This week:

  • full run in staging after resolving all bugs
  • automatic trigger implementation

github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2024
…tests (#10164)

## Problem

part of #9114

In #10127 we fixed the race,
but we didn't add the errors to the allowlist.

## Summary of changes

* Allow repartition errors in the gc-compaction smoke test.

I think it might be worth to refactor the code to allow multiple threads
getting a copy of repartition status (i.e., using Rcu) in the future.

Signed-off-by: Alex Chi Z <chi@neon.tech>
github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2024
## Problem

We cannot get the size of the compaction queue and access the info.

Part of #9114 

## Summary of changes

* Add an API endpoint to get the compaction queue.
* gc_compaction test case now waits until the compaction finishes.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2024
## Problem

In #8103 we changed the test
case to have more test coverage of gc_compaction. Now that we have
`test_gc_compaction_smoke`, we can revert this test case to serve its
original purpose and revert the parameter changes.

part of #9114

## Summary of changes

* Revert pitr_interval from 60s to 10s.
* Assert the physical/logical size ratio in the benchmark.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
…#10044)

## Problem

In #9897 we temporarily
disabled the layer valid check because the current one only considers
the end result of all compaction algorithms, but partial gc-compaction
would temporarily produce an "invalid" layer map.

part of #9114

## Summary of changes

Allow LSN splits to overlap in the slow path check. Currently, the valid
check is only used in storage scrubber (background job) and during
gc-compaction (without taking layer lock). Therefore, it's fine for such
checks to be a little bit inefficient but more accurate.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Dec 19, 2024
…10209)

## Problem

close #10208
part of #9114 

## Summary of changes

* Ensure remote `latest_gc_cutoff` is up-to-date before removing any
files for gc-compaction.

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Dec 20, 2024

This week: did a staging run among all root timelines >= 1GB. I spent some time fixing bugs so I didn't make it for all pageservers, and only did it for pageserver-27.

After the holiday: finish up the automatic trigger work. Find a way to run experiments in the background without keeping an HTTP connection open. (May need to add new APIs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

2 participants