Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
63309: cliccl/debug_backup.go: refactor backup inspection tool r=pbardea a=Elliebababa

Previously, cli backup inspection tool was incorporated
in cliccl/load.go. Both the command and syntax are
confusing in terms of backup inspection.

This patch refactors the naming and syntax of the tool
and hang the tool under `cockroach debug`, so that
it makes more sense when using it to debug backups.

Release note (cli change): Backup inspection used to be done via
`cockroach load show ..`, which was confusing to users with
ambiguous verbs in the command chain. We refactor the syntax of
tool and make the syntax  more clear and indicative for users to
debug backups. The changes are:
`load show summary <backup_url>` ->
`debug backup show <backup_url>`
```
$ ./cockroach debug backup show $AWS_BACKUP_PATH
{
	"StartTime": "1970-01-01T00:00:00Z",
	"EndTime": "2021-04-05T22:02:25Z",
	"DataSize": "110 KiB",
	"Rows": 224,
	"IndexEntries": 324,
	"FormatVersion": 1,
	"ClusterID": "7c1b9ceb-d05a-4879-b4a4-886ace456953",
	"NodeID": 0,
	"BuildInfo": "CockroachDB CCL v21.1.0-alpha.3-1790-g25dc628337-dirty (x86_64-apple-darwin19.6.0, built 2021/03/29 18:50:47, go1.15.7)",
	"Files": [
		{
			"Path": "647462079350243329.sst",
			"Span": "/Table/4/{1-2}",
			"DataSize": "99 B",
			"IndexEntries": 0,
			"Rows": 2
		},
	],
	"Spans": "[/Table/4/{1-2}]",
	"DatabaseDescriptors": {
		"1": "system"
	},
	"TableDescriptors": {
		"53": "demodb.public.demo",
		"54": "demodb.public.foo"
	},
	"TypeDescriptors": {},
	"SchemaDescriptors": {
		"29": "public"
	}
}
```

`load show incremental <backup_url>` ->
`debug backup list-incremental <backup_url>`
```
$ ./cockroach debug backup list-incremental $AWS_BACKUP_PATH
              path             |      start time      |       end time
-------------------------------+----------------------+-----------------------
  /mybackup                    | -                    | 2021-04-09T20:22:59Z
  /mybackup/20210409/202303.20 | 2021-04-09T20:22:59Z | 2021-04-09T20:23:03Z
  /mybackup/20210409/202305.97 | 2021-04-09T20:23:03Z | 2021-04-09T20:23:05Z
(3 rows)
```

`load show backups <collection_url>` ->
`debug backup list-backups <collection_url>`
```
$ ./cockroach debug backup list-backups $AWS_COLLECTION_PATH
           path
--------------------------
  ./2021/04/09-202455.98
  ./2021/04/09-202522.59
  ./2021/04/09-202526.93
(3 rows)
```

`load show data <table_name> <backup_url> `->
`debug backup export <backup_url> --table=<table_name> [--destination=</path/to/export>]` 
```
$ ./cockroach debug backup export $AWS_BACKUP_PATH --table=demodb.public.demo
1,null,true
2,'ellie',null
```

64028: kv: don't clear raftRequestQueue of right-hand side of Range split r=nvanbenschoten a=nvanbenschoten

This commit fixes a test flake of `TestLeaderAfterSplit` I observed in CI and
which we've seen at least once in #43564 (comment).
I bisected the flake back to a591707, but that wasn't the real source of the
flakiness - the move from `multiTestContext` to `TestCluster` just changed
transport mechanism between replicas and revealed an existing bug.

The real issue here was that, upon applying a split, any previously established
`raftRequestQueue` to the RHS replica was discarded. The effect of this is that
we could see the following series of events:
```
1. r1 is created from a split
2. r1 campaigns to establish a leader for the new range
3. r1 sends MsgPreVote msgs to r2 and r3
4. s2 and s3 both receive the messages for the uninitialized r2 and r3, respectively.
5. raftRequestQueues are established for r2 and r3, and the MsgPreVotes are added
6. the split triggers to create r2 and r3 finally fire
7. the raftRequestQueues for r2 and r3 are discarded
8. the election stalls indefinitely, because the test sets RaftElectionTimeoutTicks=1000000
```

Of course, in real deployments, `RaftElectionTimeoutTicks` will never be set so
high, so a new election will be called again after about 3 seconds. Still, this
could cause unavailability immediately after a split for about 3s even in real
deployments, so it seems worthwhile to fix.

This commit fixes the issue by removing the logic to discard an uninitialized
replica's `raftRequestQueue` upon applying a split that initializes the replica.
That logic looks quite intentional, but if we look back at when it was added, we
see that it wasn't entirely deliberate. The code was added in d3b0e73, which
extracted everything except the call to `s.mu.replicas.Delete(int64(rangeID))`
from `unlinkReplicaByRangeIDLocked`. So the change wasn't intentionally
discarding the queue, it was just trying not to change the existing behavior.

This change is safe and does not risk leaking the `raftRequestQueue` because
we are removing from `s.mu.uninitReplicas` but will immediately call into
`addReplicaInternalLocked` to add an initialized replica.

Release notes (bug fix): Fix a rare race that could lead to a 3 second stall
before a Raft leader was elected on a Range immediately after it was split off
from its left-hand neighbor.

64032: kv: don't allow node liveness to regress in Gossip network r=nvanbenschoten a=nvanbenschoten

In #64028, we fixed a long-standing flake in `TestLeaderAfterSplit`. However,
the test had actually gotten more flaky recently, which I bisected back to
df826cd. The problem we occasionally see with the test is that all three
replicas of a post-split Range call an election, resulting in a hung vote. Since
the test is configured with RaftElectionTimeoutTicks=1000000, a follow-up
election is never called, so the test times out.

After some debugging, I found that the range would occasionally split while the
non-leaseholder nodes (n2 and n3) thought that the leaseholder node (n1) was not
live. This meant that their call to `shouldCampaignOnWake` in the split trigger
considered the RHS's epoch-based lease to be invalid (state = ERROR). So all
three replicas would call an election and the test would get stuck.

The offending commit introduced this new flake because of this change:
df826cd#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R454

Now that the call to `MaybeGossipNodeLiveness` is asynchronous on the
node-liveness range, it was possible for two calls to `MaybeGossipNodeLiveness`
to race, one asynchronously triggered by `leasePostApplyLocked` and one
synchronously triggered by `handleReadWriteLocalEvalResult` due to a node
liveness update. This allowed for the following ordering of events:
```
- async call reads liveness(nid:1 epo:0 exp:0,0)
- sync call writes and then reads liveness(nid:1 epo:1 exp:1619645671.921265300,0)
- sync call adds liveness(nid:1 epo:1 exp:1619645671.921265300,0) to gossip
- async call adds liveness(nid:1 epo:0 exp:0,0) to gossip
```

One this had occurred, n2 and n3 never again considered n1 live. Gossip never
recovered from this state because the liveness record was never heartbeated
again, due to the test's configuration of `RaftElectionTimeoutTicks=1000000`.

This commit fixes the bug by ensuring that all calls to MaybeGossipNodeLiveness
and MaybeGossipSystemConfig hold the raft mutex. This provides the necessary
serialization to avoid data races, which was actually already documented on
MaybeGossipSystemConfig.

64070: build: retry register.cockrachdb.com requests r=rail a=stevendanna

Retry transient problems getting a developer license. We may want to
avoid the retries here as a forcing function to make that service more
reliable, but given that we might see timeouts from non-service
related local networking problems, I'd prefer to not fail the test run
because of a single curl failure.

Release note: None

Co-authored-by: elliebababa <ellie24.huang@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
  • Loading branch information
4 people committed Apr 22, 2021
5 parents 98ba7f3 + 3caed9e + 4e5389a + 3e4adfc + d974193 commit 5f40d69
Show file tree
Hide file tree
Showing 18 changed files with 207 additions and 176 deletions.
2 changes: 1 addition & 1 deletion build/teamcity-local-roachtest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ source "$(dirname "${0}")/teamcity-support.sh"

tc_start_block "Prepare environment"
# Grab a testing license good for one hour.
COCKROACH_DEV_LICENSE=$(curl -f "https://register.cockroachdb.com/api/prodtest")
COCKROACH_DEV_LICENSE=$(curl --retry 3 --retry-connrefused -f "https://register.cockroachdb.com/api/prodtest")
run mkdir -p artifacts
maybe_ccache
tc_end_block "Prepare environment"
Expand Down
9 changes: 6 additions & 3 deletions pkg/ccl/backupccl/targets.go
Original file line number Diff line number Diff line change
Expand Up @@ -474,9 +474,12 @@ func MakeBackupTableEntry(
for i, b := range backupManifests {
if b.StartTime.Less(endTime) && endTime.LessEq(b.EndTime) {
if endTime != b.EndTime && b.MVCCFilter != MVCCFilter_All {
return BackupTableEntry{}, errors.Newf(
"reading data for requested time requires that BACKUP was created with %q"+
" or should specify the time to be an exact backup time, nearest backup time is %s", backupOptRevisionHistory, timeutil.Unix(0, b.EndTime.WallTime).UTC())
errorHints := "reading data for requested time requires that BACKUP was created with %q" +
" or should specify the time to be an exact backup time, nearest backup time is %s"
return BackupTableEntry{}, errors.WithHintf(
errors.Newf("unknown read time: %s", timeutil.Unix(0, endTime.WallTime).UTC()),
errorHints, backupOptRevisionHistory, timeutil.Unix(0, b.EndTime.WallTime).UTC(),
)
}
ind = i
break
Expand Down
4 changes: 2 additions & 2 deletions pkg/ccl/cliccl/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ go_library(
srcs = [
"cliccl.go",
"debug.go",
"debug_backup.go",
"demo.go",
"load.go",
"mtproxy.go",
"start.go",
],
Expand Down Expand Up @@ -65,7 +65,7 @@ go_test(
name = "cliccl_test",
size = "small",
srcs = [
"load_test.go",
"debug_backup_test.go",
"main_test.go",
],
embed = [":cliccl"],
Expand Down
132 changes: 65 additions & 67 deletions pkg/ccl/cliccl/load.go → pkg/ccl/cliccl/debug_backup.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ import (
"os"
"path/filepath"
"strings"
"text/tabwriter"
"time"

"github.com/cockroachdb/apd/v2"
Expand Down Expand Up @@ -54,113 +53,111 @@ import (
)

var externalIODir string
var exportTableName string
var readTime string
var destination string
var format string
var nullas string

func init() {

loadShowSummaryCmd := &cobra.Command{
Use: "summary <backup_path>",
Short: "show backups summary",
showCmd := &cobra.Command{
Use: "show <backup_path>",
Short: "show backup summary",
Long: "Shows summary of meta information about a SQL backup.",
Args: cobra.ExactArgs(1),
RunE: cli.MaybeDecorateGRPCError(runLoadShowSummary),
RunE: cli.MaybeDecorateGRPCError(runShowCmd),
}

loadShowBackupsCmd := &cobra.Command{
Use: "backups <backup_path>",
Short: "show backups in collections",
Long: "Shows full backups in a backup collections.",
listBackupsCmd := &cobra.Command{
Use: "list-backups <collection_path>",
Short: "show backups in collection",
Long: "Shows full backup paths in a backup collection.",
Args: cobra.ExactArgs(1),
RunE: cli.MaybeDecorateGRPCError(runLoadShowBackups),
RunE: cli.MaybeDecorateGRPCError(runListBackupsCmd),
}

loadShowIncrementalCmd := &cobra.Command{
Use: "incremental <backup_path>",
listIncrementalCmd := &cobra.Command{
Use: "list-incremental <backup_path>",
Short: "show incremental backups",
Long: "Shows incremental chain of a SQL backup.",
Args: cobra.ExactArgs(1),
RunE: cli.MaybeDecorateGRPCError(runLoadShowIncremental),
RunE: cli.MaybeDecorateGRPCError(runListIncrementalCmd),
}

loadShowDataCmd := &cobra.Command{
Use: "data <table> <backup_path>",
Short: "show data",
Long: "Shows data of a SQL backup.",
Args: cobra.MinimumNArgs(2),
RunE: cli.MaybeDecorateGRPCError(runLoadShowData),
exportDataCmd := &cobra.Command{
Use: "export <backup_path>",
Short: "export table data from a backup",
Long: "export table data from a backup, requires specifying --table to export data from",
Args: cobra.MinimumNArgs(1),
RunE: cli.MaybeDecorateGRPCError(runExportDataCmd),
}

loadShowCmds := &cobra.Command{
Use: "show [command]",
Short: "show backups",
backupCmds := &cobra.Command{
Use: "backup [command]",
Short: "debug backups",
Long: "Shows information about a SQL backup.",
Args: cobra.MinimumNArgs(1),
RunE: func(cmd *cobra.Command, args []string) error {
return cmd.Usage()
},
}

loadCmds := &cobra.Command{
Use: "load [command]",
Short: "load backup commands",
Long: `Commands for bulk loading external files.`,
RunE: func(cmd *cobra.Command, args []string) error {
return cmd.Usage()
},
}

loadFlags := loadCmds.Flags()
loadFlags.StringVarP(
backupFlags := backupCmds.Flags()
backupFlags.StringVarP(
&externalIODir,
cliflags.ExternalIODir.Name,
cliflags.ExternalIODir.Shorthand,
"", /*value*/
cliflags.ExternalIODir.Usage())

loadShowDataCmd.Flags().StringVarP(
exportDataCmd.Flags().StringVarP(
&exportTableName,
cliflags.ExportTableTarget.Name,
cliflags.ExportTableTarget.Shorthand,
"", /*value*/
cliflags.ExportTableTarget.Usage())

exportDataCmd.Flags().StringVarP(
&readTime,
cliflags.ReadTime.Name,
cliflags.ReadTime.Shorthand,
"", /*value*/
cliflags.ReadTime.Usage())

loadShowDataCmd.Flags().StringVarP(
exportDataCmd.Flags().StringVarP(
&destination,
cliflags.ExportDestination.Name,
cliflags.ExportDestination.Shorthand,
"", /*value*/
cliflags.ExportDestination.Usage())

loadShowDataCmd.Flags().StringVarP(
exportDataCmd.Flags().StringVarP(
&format,
cliflags.ExportTableFormat.Name,
cliflags.ExportTableFormat.Shorthand,
"csv", /*value*/
cliflags.ExportTableFormat.Usage())

loadShowDataCmd.Flags().StringVarP(
exportDataCmd.Flags().StringVarP(
&nullas,
cliflags.ExportCSVNullas.Name,
cliflags.ExportCSVNullas.Shorthand,
"null", /*value*/
cliflags.ExportCSVNullas.Usage())

cli.AddCmd(loadCmds)
loadCmds.AddCommand(loadShowCmds)
cli.DebugCmd.AddCommand(backupCmds)

loadShowSubCmds := []*cobra.Command{
loadShowSummaryCmd,
loadShowBackupsCmd,
loadShowIncrementalCmd,
loadShowDataCmd,
backupSubCmds := []*cobra.Command{
showCmd,
listBackupsCmd,
listIncrementalCmd,
exportDataCmd,
}

for _, cmd := range loadShowSubCmds {
loadShowCmds.AddCommand(cmd)
cmd.Flags().AddFlagSet(loadFlags)
for _, cmd := range backupSubCmds {
backupCmds.AddCommand(cmd)
cmd.Flags().AddFlagSet(backupFlags)
}
}

Expand Down Expand Up @@ -198,7 +195,7 @@ func getManifestFromURI(ctx context.Context, path string) (backupccl.BackupManif
return backupManifest, nil
}

func runLoadShowSummary(cmd *cobra.Command, args []string) error {
func runShowCmd(cmd *cobra.Command, args []string) error {

path := args[0]
ctx := context.Background()
Expand All @@ -217,7 +214,7 @@ func runLoadShowSummary(cmd *cobra.Command, args []string) error {
return nil
}

func runLoadShowBackups(cmd *cobra.Command, args []string) error {
func runListBackupsCmd(cmd *cobra.Command, args []string) error {

path := args[0]
if !strings.Contains(path, "://") {
Expand All @@ -235,18 +232,17 @@ func runLoadShowBackups(cmd *cobra.Command, args []string) error {
return errors.Wrapf(err, "list full backups in collection")
}

if len(backupPaths) == 0 {
fmt.Println("no backups found.")
}

cols := []string{"path"}
rows := make([][]string, 0)
for _, backupPath := range backupPaths {
fmt.Println("./" + backupPath)
newRow := []string{"./" + backupPath}
rows = append(rows, newRow)
}

return nil
rowSliceIter := cli.NewRowSliceIter(rows, "l" /*align*/)
return cli.PrintQueryOutput(os.Stdout, cols, rowSliceIter)
}

func runLoadShowIncremental(cmd *cobra.Command, args []string) error {
func runListIncrementalCmd(cmd *cobra.Command, args []string) error {

path := args[0]
if !strings.Contains(path, "://") {
Expand All @@ -270,12 +266,12 @@ func runLoadShowIncremental(cmd *cobra.Command, args []string) error {
return err
}

w := tabwriter.NewWriter(os.Stdout, 28 /*minwidth*/, 1 /*tabwidth*/, 2 /*padding*/, ' ' /*padchar*/, 0 /*flags*/)
basepath := uri.Path
manifestPaths := append([]string{""}, incPaths...)
stores := make([]cloud.ExternalStorage, len(manifestPaths))
stores[0] = store

rows := make([][]string, 0)
for i := range manifestPaths {

if i > 0 {
Expand All @@ -296,19 +292,21 @@ func runLoadShowIncremental(cmd *cobra.Command, args []string) error {
if i == 0 {
startTime = "-"
}
fmt.Fprintf(w, "%s %s %s\n", uri.Path, startTime, endTime)
}

if err := w.Flush(); err != nil {
return err
newRow := []string{uri.Path, startTime, endTime}
rows = append(rows, newRow)
}
return nil
cols := []string{"path", "start time", "end time"}
rowSliceIter := cli.NewRowSliceIter(rows, "lll" /*align*/)
return cli.PrintQueryOutput(os.Stdout, cols, rowSliceIter)
}

func runLoadShowData(cmd *cobra.Command, args []string) error {
func runExportDataCmd(cmd *cobra.Command, args []string) error {

fullyQualifiedTableName := strings.ToLower(args[0])
manifestPaths := args[1:]
if exportTableName == "" {
return errors.New("export data requires table name specified by --table flag")
}
fullyQualifiedTableName := strings.ToLower(exportTableName)
manifestPaths := args

ctx := context.Background()
manifests := make([]backupccl.BackupManifest, 0, len(manifestPaths))
Expand Down
Loading

0 comments on commit 5f40d69

Please sign in to comment.