Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: pageserver_ondisk_layers metric #3775

Closed
wants to merge 10 commits into from
Closed

fix: pageserver_ondisk_layers metric #3775

wants to merge 10 commits into from

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Mar 9, 2023

Current pageserver_ondisk_layers is wrong, see issue. To fix that, we need trait PersistentLayer in LayerMap. Solve this need by introducing a wrapper MeasuredLayerMap where the metrics are updated. Additionally an existing test which does evictions is modified to check this fixed metric (couldn't find an existing test case and nothing broke).

Additionally has three unrelated fixes.

Cc: #3705 (separate from adding a metric for remote layers)

allows us to keep the layermap clean of these concerns.
@koivunej
Copy link
Member Author

koivunej commented Mar 9, 2023

Need to check if we have any test for this metric, if not, fix that and then last split the metric up.

@koivunej
Copy link
Member Author

koivunej commented Mar 9, 2023

@SomeoneToIgnore do you feel this approach of deref-abuse to delegate is acceptable with LayerMap which doesn't implement PartialEq and friends?

@koivunej
Copy link
Member Author

koivunej commented Mar 9, 2023

deref-abuse

Did a poll internally, no one is horrified.

@koivunej koivunej force-pushed the fix_ondisk_metric branch from 856b187 to e1b9005 Compare March 9, 2023 12:51
@koivunej koivunej marked this pull request as ready for review March 9, 2023 12:54
@koivunej koivunej requested review from a team as code owners March 9, 2023 12:54
@koivunej koivunej requested review from knizhnik and hlinnaka and removed request for a team March 9, 2023 12:54
@koivunej
Copy link
Member Author

koivunej commented Mar 9, 2023

merge-allure-reports failed debug?

+ aws s3 cp --only-show-errors s3://neon-github-public-dev/reports/pr-3775/lock.txt ./lock.txt
fatal error: An error occurred (404) when calling the HeadObject operation: Key "reports/pr-3775/lock.txt" does not exist

pageserver_http.get_metrics().query_one("pageserver_ondisk_layers").value
)

# assumption: floats here are small enough to compare with integers safely
Copy link
Member Author

@koivunej koivunej Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the value < 2**53 and value.is_integer() dance on the other PR, maybe should move that to Sample which is returned by this query_one, something like ... def value_as_int(self) on a follow-up.

Comment on lines +657 to +660
assert total_populated_layers == post_eviction_total_layers + 4
# corrected with remotes_after because only 3 out of 4 seem to be usually
# required for layer creation
assert post_compaction_total_layers == total_populated_layers + 1 - remotes_after
Copy link
Member Author

@koivunej koivunej Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These asserts are quite awful after all.

total_populated_layers == 4 (deltas before creating image)
post_eviction_total_layers == 0 (evicted all layers)
post_compaction_total_layers == 4 + 1 - N (N = remotes not needed for imaging)

should these be put in as constants and let's see how long they last? :) Feels like that's adding work to konstantin's current PRs but at the same time I am unsure if these will change.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the resident_physical_size gauge into the MeasuredLayerMap as well?

(If you still want to attempt a Collector-based approach, that should be a separate PR)

pageserver/src/tenant/layer_map.rs Show resolved Hide resolved
@koivunej
Copy link
Member Author

koivunej commented Mar 16, 2023

It's non-trivial to move all affected metrics in (I wasn't expecting there to be many other) so drafting while waiting for time to look at this.

@koivunej
Copy link
Member Author

koivunej commented Apr 14, 2023

Mentioned for generation/version number which would be handy in number of cases: https://github.com/neondatabase/neon/pull/4005/files#r1165309034

@koivunej
Copy link
Member Author

Sad to leave this out because of scope creep.

@koivunej koivunej closed this Apr 25, 2023
@koivunej koivunej deleted the fix_ondisk_metric branch April 25, 2023 15:00
koivunej added a commit that referenced this pull request Apr 26, 2023
I didn't get through #3775 fast enough so we wanted to remove this
metric.

Fixes #3705.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants