br: restore checksum shouldn't rely on backup checksum #56712

Tristan1900 · 2024-10-18T01:20:09Z

What problem does this PR solve?

Issue Number: close #56373

Problem Summary:

What changed and how does it work?

Disable checksum by default only for snapshot backup process.
Keep checksum enabled for the rest
During snapshot restore, verify admin checksum by doing xor on all meta file checksums, so it doesn't rely on backup generating the CRC for the database.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

The default behavior of doing checksum during full backup has changed from enabled to disabled. 
The meaning of checksum flag is kind of ambiguous here, it doesn't mean we don't do the checksum of files when doing backup, instead we don't do checksum to compare from coprocessor side to the br side to improve the performance

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

ti-chi-bot · 2024-10-18T01:20:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tiprow · 2024-10-18T01:20:24Z

Hi @Tristan1900. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

codecov · 2024-10-21T22:22:40Z

Codecov Report

Attention: Patch coverage is 75.00000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 59.8768%. Comparing base (255ea42) to head (95059bc).
Report is 401 commits behind head on master.

Additional details and impacted files

@@                Coverage Diff                @@
##             master     #56712         +/-   ##
=================================================
- Coverage   73.3472%   59.8768%   -13.4705%     
=================================================
  Files          1635       1801        +166     
  Lines        452734     674211     +221477     
=================================================
+ Hits         332068     403696      +71628     
- Misses       100306     246164     +145858     
- Partials      20360      24351       +3991

Flag	Coverage Δ
integration	`40.6202% <75.0000%> (?)`
unit	`75.0713% <32.9545%> (+2.5971%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`55.2602% <ø> (+2.3124%)`	⬆️
parser	`∅ <ø> (∅)`
br	`64.1702% <72.5000%> (+18.1375%)`	⬆️

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

Tristan1900 · 2024-10-23T16:21:51Z

br/tests/br_file_corruption/run.sh

+    mv "$filename" "$filename_bak"
+done
+
+# need to drop db otherwise restore will fail because of cluster not fresh but not the expected issue 


previous br_file_corruption is not testing what it should be testing, restore fails because of cluster not fresh but not because of corruption.

Tristan1900 · 2024-10-23T16:23:07Z

br/tests/br_file_corruption/run.sh

-echo "corruption" > $filename_temp
-cat $filename >> $filename_temp
+# Replace the single file manipulation with a loop over all .sst files
+for filename in $(find $TEST_DIR/$DB -name "*.sst"); do


need to corrupt every sst file cuz during restore not all sst are going to be used, since backup backs up a bunch of tables but some of them are going to be filtered out during restore

Tristan1900 · 2024-10-23T16:25:51Z

br/tests/br_file_corruption/run.sh

-truncate --size=-11 $filename
+for filename in $(find $TEST_DIR/$DB -name "*.sst_temp"); do
+    mv "$filename" "${filename%_temp}"
+    truncate -s 11 "${filename%_temp}"


--size=-11 doesn't work on macOS

Suggested change

truncate -s 11 "${filename%_temp}"

truncate -s -11 "${filename%_temp}"

Tristan1900 · 2024-10-23T21:47:23Z

br/pkg/restore/snap_client/client.go

-	if tbl.OldTable.NoChecksum() {
-		logger.Warn("table has no checksum, skipping checksum")
+	expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files)
+	if !expectedChecksumStats.ChecksumExists() {


not sure if we need to do the check here, if an empty table is backed up, should we check after restore if it's still empty?

Yes, the error log is misleading.

should I remove the checking here at all?

Tristan1900 · 2024-10-23T21:49:28Z

/assign @Leavrth @YuJuncen

Leavrth

rest LGTM

Leavrth · 2024-10-24T07:04:19Z

br/pkg/task/backup.go

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
 	if err != nil {
 		log.Panic("infallible operation failed.", zap.Error(err))
 	}
+
+	// Check if the checksum flag was set by the user
+	if !fs.Changed("checksum") {


use flagChecksum instead

oh right... my mistake

It seems in this context, fs.Changed(anything) is always false... As the flagset is newly created at here

BTW, can we override the default value of --checksum in DefineBackupFlags...?

I feel like having override in DefineBackupFlags is a bit weird since it's technically not define, might be a bit confusing

Leavrth · 2024-10-24T07:08:11Z

br/pkg/task/restore.go

-	// pipeline checksum
-	if cfg.Checksum {
+	// pipeline checksum only when enabled and is not incremental snapshot repair mode cuz incremental doesn't have
+	// enough information in backup meta to validate checksum


we don't use the collection of file-level checksums to match the table-level checksum in the incremental backup. However, we use the table-level checksum in the incremental backup to match the table-level checksum in the incremental restore.

could you point me to the code, I thought table level checksum for incremental is not calculated during backup

skipChecksum := !cfg.Checksum || isIncrementalBackup

YuJuncen

rest lgtm

YuJuncen · 2024-10-25T07:39:28Z

br/pkg/backup/schema.go

@@ -145,7 +145,7 @@ func (ss *Schemas) BackupSchemas(
 							zap.Uint64("Crc64Xor", schema.crc64xor),
 							zap.Uint64("TotalKvs", schema.totalKvs),
 							zap.Uint64("TotalBytes", schema.totalBytes),
-							zap.Duration("calculate-take", calculateCost))
+							zap.Duration("Time taken", calculateCost))


Keep consistency with other fields?

Suggested change

zap.Duration("Time taken", calculateCost))

zap.Duration("TimeTaken", calculateCost))

YuJuncen · 2024-10-25T07:47:17Z

br/pkg/metautil/metafile.go

+func (stats *ChecksumStats) ChecksumExists() bool {
+	if stats == nil {
+		return false
+	}


It seems pointer receiver isn't needed here? As I cannot find where a *ChecksumStats instead of ChecksumState is needed.

Suggested change

func (stats *ChecksumStats) ChecksumExists() bool {

if stats == nil {

return false

}

func (stats ChecksumStats) ChecksumExists() bool {

YuJuncen · 2024-10-25T07:49:07Z

br/pkg/restore/snap_client/client.go

-		logger.Warn("table has no checksum, skipping checksum")
+	expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files)
+	if !expectedChecksumStats.ChecksumExists() {
+		logger.Error("table has no checksum, skipping checksum")


Would you also print the table and database name here? Also I think this can be a warning instead of an error.

right, I think this line above adds the db and table as fields for all subsequent logging

logger := log.L().With( zap.String("db", tbl.OldTable.DB.Name.O), zap.String("table", tbl.OldTable.Info.Name.O), )

I'm a bit skeptical about this logic, should we just remove it? If there is an empty table restored(would there be a case?), we should still check the checksum make sure it's empty. What do you think?

YuJuncen · 2024-10-25T07:50:36Z

br/pkg/restore/snap_client/client.go

 			zap.Uint64("calculated total bytes", item.TotalBytes),
 		)
 		return errors.Annotate(berrors.ErrRestoreChecksumMismatch, "failed to validate checksum")
 	}
-	logger.Info("success in validate checksum")
+	logger.Info("success in validating checksum")


Would you also add the table / database name to this log?

same as above

YuJuncen · 2024-10-25T08:44:11Z

br/pkg/task/backup.go

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
 	if err != nil {
 		log.Panic("infallible operation failed.", zap.Error(err))
 	}
+
+	// Check if the checksum flag was set by the user
+	if !fs.Changed("checksum") {


It seems in this context, fs.Changed(anything) is always false... As the flagset is newly created at here

YuJuncen · 2024-10-25T08:44:40Z

br/pkg/task/backup.go

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
 	if err != nil {
 		log.Panic("infallible operation failed.", zap.Error(err))
 	}
+
+	// Check if the checksum flag was set by the user
+	if !fs.Changed("checksum") {


BTW, can we override the default value of --checksum in DefineBackupFlags...?

YuJuncen · 2024-10-25T09:17:34Z

br/pkg/task/backup_test.go

+	// Test with checksum flag set
+	os.Args = []string{"cmd", "--checksum=true"}
+	cfg = DefaultBackupConfig()
+	require.True(t, cfg.Checksum)


I don't think the DefaultBackupConfig will parse command line from os.Args... TBH I'm even not sure how this case passes...

yeah you are absolutely right, should revisit before making PR read for review. It did pass locally and on CI which is weird.

YuJuncen · 2024-10-25T09:23:13Z

br/tests/br_file_corruption/run.sh

+run_br --pd $PD_ADDR restore full -s "local://$TEST_DIR/$DB" --checksum=true || restore_fail=1
+export GO_FAILPOINTS=""
+if [ $restore_fail -ne 1 ]; then
+    echo 'expect restore to fail on checksum mismatch but succeed'


nit: I guess if we have a corrupted SST file, the restoration fails though with --checksum=false... Perhaps a better (also harder) way is to delete an file entry in the backupmeta.

oh it's not testing corrupted SST file in this test, the valid files are moved back after the previous corruption test, This test is just to verify the checksum process is running even backup disables the checksum, by injecting the fail point into the checksum routine. Right after this test there is a sanity test that can restore successfully, verifying all files are valid.

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

Tristan1900 · 2024-11-18T16:11:27Z

/retest

tiprow · 2024-11-18T16:13:20Z

@Tristan1900: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
tidb_parser_test	`95059bc`	link	true	`/test tidb_parser_test`
fast_test_tiprow	`95059bc`	link	true	`/test fast_test_tiprow`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot · 2024-11-19T09:09:17Z

In response to a cherrypick label: new pull request created to branch release-8.5: #57503.

close #56373

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2024-12-02T09:27:13Z

In response to a cherrypick label: new pull request created to branch release-7.5: #57887.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2024-12-04T10:30:18Z

In response to a cherrypick label: new pull request created to branch release-8.1: #57984.

close pingcap#56373 (cherry picked from commit 4f047be)

ti-chi-bot · 2024-12-17T08:16:54Z

In response to a cherrypick label: new pull request created to branch release-6.5: #58335.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

initial commit to fix restore checksum

8b843bd

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Oct 18, 2024

add tests, disable default checksum in backup

829e6bc

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-tests-checked size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 21, 2024

Tristan1900 marked this pull request as ready for review October 21, 2024 22:11

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2024

Tristan1900 added 4 commits October 21, 2024 21:34

fix tests

03b2bf9

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

fix again

134692b

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

Merge branch 'master' into checksum

a835140

fix

9fa90b9

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

Tristan1900 commented Oct 23, 2024

View reviewed changes

ti-chi-bot bot assigned Leavrth and YuJuncen Oct 23, 2024

Leavrth reviewed Oct 24, 2024

View reviewed changes

YuJuncen reviewed Oct 25, 2024

View reviewed changes

address comments

3e9053d

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>

ti-chi-bot bot merged commit 4f047be into pingcap:master Nov 18, 2024
31 of 33 checks passed

Tristan1900 deleted the checksum branch November 18, 2024 16:49

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Nov 19, 2024

ti-chi-bot mentioned this pull request Nov 19, 2024

br: restore checksum shouldn't rely on backup checksum (#56712) #57503

Merged

13 tasks

ti-chi-bot bot pushed a commit that referenced this pull request Nov 20, 2024

br: restore checksum shouldn't rely on backup checksum (#56712) (#57503)

af330c1

close #56373

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 20, 2024

ti-chi-bot bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Dec 2, 2024

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 2, 2024

This is an automated cherry-pick of pingcap#56712

ff0ad33

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Dec 2, 2024

br: restore checksum shouldn't rely on backup checksum (#56712) #57887

Open

13 tasks

ti-chi-bot bot added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Dec 4, 2024

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 4, 2024

This is an automated cherry-pick of pingcap#56712

aa21365

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Dec 4, 2024

br: restore checksum shouldn't rely on backup checksum (#56712) #57984

Closed

13 tasks

ti-chi-bot bot removed the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Dec 12, 2024

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 12, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

c3b8485

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 12, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

dd45a94

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

5689968

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

2b171a2

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

1f297a0

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

e2d8c6b

close pingcap#56373 (cherry picked from commit 4f047be)

Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024

br: restore checksum shouldn't rely on backup checksum (pingcap#56712)

3d4d3f3

close pingcap#56373 (cherry picked from commit 4f047be)

ti-chi-bot bot added the needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. label Dec 17, 2024

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 17, 2024

This is an automated cherry-pick of pingcap#56712

ac7f0e9

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Dec 17, 2024

br: restore checksum shouldn't rely on backup checksum (#56712) #58335

Open

13 tasks

	truncate -s 11 "${filename%_temp}"
	truncate -s -11 "${filename%_temp}"

	zap.Duration("Time taken", calculateCost))
	zap.Duration("TimeTaken", calculateCost))

br: restore checksum shouldn't rely on backup checksum #56712

br: restore checksum shouldn't rely on backup checksum #56712

Conversation

Tristan1900 commented Oct 18, 2024 • edited Loading

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

ti-chi-bot bot commented Oct 18, 2024

tiprow bot commented Oct 18, 2024

codecov bot commented Oct 21, 2024 • edited Loading

Codecov Report

Tristan1900 Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Tristan1900 Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tristan1900 commented Oct 23, 2024

Leavrth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YuJuncen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tristan1900 commented Nov 18, 2024

tiprow bot commented Nov 18, 2024

ti-chi-bot commented Nov 19, 2024

ti-chi-bot commented Dec 2, 2024

ti-chi-bot commented Dec 4, 2024

ti-chi-bot commented Dec 17, 2024

Tristan1900 commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

Tristan1900 Oct 23, 2024 •

edited

Loading

Tristan1900 Oct 23, 2024 •

edited

Loading