Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log backup: use global checkpoint ts as source of truth #58135

Merged
merged 7 commits into from
Dec 13, 2024

Conversation

3pointer
Copy link
Contributor

@3pointer 3pointer commented Dec 10, 2024

What problem does this PR solve?

Issue Number: close #58031

Problem Summary:

The previous lag calculation relied on c.lastCheckpoint.TS to compute the lag. However, this approach is unreliable, especially when ownership changes, as c.lastCheckpoint.TS is not guaranteed to increase steadily. This PR addresses the issue by introducing a global checkpoint timestamp that maintains a strictly non-decreasing state.

What changed and how does it work?

The lag calculation now utilizes a global checkpoint timestamp instead of c.lastCheckpoint.TS. This global timestamp ensures consistency and stability, as it always increases or stays the same, even during ownership transitions. This change guarantees a more robust and accurate lag measurement.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix this issue that when new advancer owner starts the task unexpected paused due to last checkpoint ts equal to start ts

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 10, 2024
Copy link

tiprow bot commented Dec 10, 2024

Hi @3pointer. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.

Project coverage is 74.9132%. Comparing base (68ac9ec) to head (99c99bb).
Report is 71 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #58135        +/-   ##
================================================
+ Coverage   73.1841%   74.9132%   +1.7291%     
================================================
  Files          1675       1693        +18     
  Lines        461917     466183      +4266     
================================================
+ Hits         338050     349233     +11183     
+ Misses       103127      95400      -7727     
- Partials      20740      21550       +810     
Flag Coverage Δ
integration 46.5841% <0.0000%> (?)
unit 72.5386% <71.4285%> (+0.2238%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.6910% <ø> (ø)
parser ∅ <ø> (∅)
br 61.7720% <71.4285%> (+15.8112%) ⬆️

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 10, 2024
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 11, 2024
@@ -548,8 +626,10 @@ func TestCheckPointLagged(t *testing.T) {
})
adv.StartTaskListener(ctx)
c.advanceClusterTimeBy(2 * time.Minute)
// if global ts is not advanced, the checkpoint will not be lagged
Copy link
Contributor

@RidRisR RidRisR Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? If there is a new task and the global checkpoint is never advanced, the task will never be paused even exceed the limit.

This implies that we should never pause a task that never advanced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it should be if global ts is less than task.start-ts which implies that could have some corner cases when start a new task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	if globalTs <= c.task.StartTs {
		// task is not started yet
		return false, nil
	}

Then maybe here should be < instead of <= ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if globalTs == c.task.StartTs this will only happen when new task created.

Since it's not the common case after task running for some time. I think it's better to make this not pause by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the task may be stuck from creating, say, the advancer doesn't work or one of TiKV didn't notice the task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talk with @YuJuncen , we had a agreement to make the check include when globalTs == task.StartTs. I changed it.

Additionally I found the unproper error return logic when add task. I also fixed it this PR, and fix the related test cases.

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 12, 2024
@3pointer
Copy link
Contributor Author

/retest

Copy link

tiprow bot commented Dec 12, 2024

@3pointer: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

ti-chi-bot bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BornChanger, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 13, 2024
Copy link

ti-chi-bot bot commented Dec 13, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-12 05:24:24.307792113 +0000 UTC m=+502454.396594649: ☑️ agreed by YuJuncen.
  • 2024-12-13 12:29:15.088349356 +0000 UTC m=+614345.177151898: ☑️ agreed by BornChanger.

Copy link

tiprow bot commented Dec 13, 2024

@3pointer: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 99c99bb link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@BornChanger BornChanger added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Dec 13, 2024
@ti-chi-bot ti-chi-bot bot merged commit e3248e7 into pingcap:master Dec 13, 2024
30 of 33 checks passed
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 13, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #58259.

@BornChanger BornChanger added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Dec 14, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #58265.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log backup advancer stop the log backup task when it failed to get global checkpoint ts at the first tick
5 participants