Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

Closed
cockroach-teamcity opened this issue Aug 8, 2019 · 22 comments · Fixed by #40415
Closed

roachtest: tpcc/mixed-headroom/n5cpu16 failed #39460

cockroach-teamcity opened this issue Aug 8, 2019 · 22 comments · Fixed by #40415
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/1051f376924f18c5782887ed824ab5ede5027493

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1429376&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190808-1429376/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:161,tpcc.go:163,tpcc.go:174,tpcc.go:287,test_runner.go:691: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1565244744-51-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		2.3 MiB)
		I190808 15:38:24.294129 48 ccl/workloadccl/fixture.go:544  loaded 887 KiB table district in 49.365343776s (10000 rows, 0 index entries, 18 KiB)
		I190808 15:46:13.672006 147 ccl/workloadccl/fixture.go:544  loaded 899 MiB table order in 8m38.743319862s (30000000 rows, 30000000 index entries, 1.7 MiB)
		I190808 15:54:04.953429 146 ccl/workloadccl/fixture.go:544  loaded 3.0 GiB table history in 16m30.02486995s (30000000 rows, 60000000 index entries, 3.1 MiB)
		I190808 16:05:15.465386 49 ccl/workloadccl/fixture.go:544  loaded 18 GiB table customer in 27m40.536546241s (30000000 rows, 30000000 index entries, 11 MiB)
		I190808 16:14:28.827232 150 ccl/workloadccl/fixture.go:544  loaded 32 GiB table stock in 36m53.898478143s (100000000 rows, 100000000 index entries, 15 MiB)
		I190808 16:26:22.588272 151 ccl/workloadccl/fixture.go:544  loaded 17 GiB table order_line in 48m47.659755157s (300011520 rows, 300011520 index entries, 5.8 MiB)
		Error: restoring fixture: pq: syntax error at or near "inject"
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Aug 8, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Aug 8, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1460982&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190829-1460982/tpcc/mixed-headroom/n5cpu16/run_1
	test_runner.go:688: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461635&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1461635/tpcc/mixed-headroom/n5cpu16/run_1
	test_runner.go:688: test timed out (10h0m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e8faca611a902766154ed82581d6d3a7483ad231

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1462518&tab=buildLog

The test failed on branch=provisional_201908291837_v19.2.0-beta.20190903, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190830-1462518/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:162,tpcc.go:164,tpcc.go:175,tpcc.go:288,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567188851-78-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		932 78 ccl/workloadccl/fixture.go:544  loaded 7.8 MiB table item in 11m4.230218445s (100000 rows, 0 index entries, 12 KiB)
		I190831 00:16:10.134418 74 ccl/workloadccl/fixture.go:544  loaded 17 GiB table customer in 11m32.319972731s (30000000 rows, 30000000 index entries, 26 MiB)
		I190831 00:16:17.073731 76 ccl/workloadccl/fixture.go:544  loaded 1.3 GiB table order in 11m39.259084823s (30000000 rows, 30000000 index entries, 1.9 MiB)
		I190831 00:16:19.920875 75 ccl/workloadccl/fixture.go:544  loaded 4.3 GiB table history in 11m42.106337497s (30000000 rows, 60000000 index entries, 6.2 MiB)
		Error: restoring fixture: pq: importing 1101 ranges: split at key /Table/58/1/134/82131 failed: unexpected value: raw_bytes:"\301\214\342\205\003\010\222\003\022\010\302\211\366\205\370\0010\035\032\007\302\211\366\234\367\271\256\"\006\010\001\020\001\030\001\"\006\010\003\020\003\030\002\"\006\010\004\020\004\030\003\"\010\010\002\020\002\030\004 \000(\0050\020" timestamp:<wall_time:1567209887452784586 > 
		Error:  exit status 1
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190902-1463583/tpcc/mixed-headroom/n5cpu16/run_1
	cluster.go:1735,tpcc.go:162,tpcc.go:164,tpcc.go:175,tpcc.go:288,test_runner.go:673: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1567405952-78-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		oadccl/fixture.go:544  loaded 4.3 GiB table history in 12m59.463696846s (30000000 rows, 60000000 index entries, 5.6 MiB)
		I190902 11:32:18.839982 119 ccl/workloadccl/fixture.go:544  loaded 1.3 GiB table order in 13m8.002791342s (30000000 rows, 30000000 index entries, 1.7 MiB)
		I190902 11:32:31.082507 120 ccl/workloadccl/fixture.go:544  loaded 126 MiB table new_order in 13m20.245122703s (9000000 rows, 0 index entries, 161 KiB)
		I190902 11:43:05.602125 123 ccl/workloadccl/fixture.go:544  loaded 23 GiB table order_line in 23m54.764995491s (300011520 rows, 300011520 index entries, 16 MiB)
		Error: restoring fixture: pq: importing 610 ranges: split at key /Table/58/1/781/4/687 failed: unexpected value: raw_bytes:"\021\005*\243\003\010\302\013\022\t\302\211\367\003\013\214\367\010C\032\t\302\211\367\003\035\214\367\n\330\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\002\"\006\010\003\020\003\030\003\"\010\010\002\020\002\030\004 \000(\0050\030" timestamp:<wall_time:1567423691929620802 > 
		Error:  exit status 1
		: exit status 1

@tbg
Copy link
Member

tbg commented Sep 2, 2019

I'm seeing two major failure modes here:

  1. failed splits during the import with a message like
Error: restoring fixture: pq: importing 1528 ranges: split at key /T
raw_bytes:"\344x\317\000\003\010\351\003\022\t\301\211\322\220\206\370\364\214\210
\001\020\001\030\001"\006\010\003\020\003\030\002"\006\010\004\020\004\030\003"\0
tamp:<wall_time:1567431351149034101 >

This is unexpected because we never expect a raw *ConditionFailedError to be returned from adminSplitWithDescriptor, both in 19.2 and 19.1 (with the same code):

19.2:

}); err != nil {
// The ConditionFailedError can occur because the descriptors acting
// as expected values in the CPuts used to update the left or right
// range descriptors are picked outside the transaction. Return
// ConditionFailedError in the error detail so that the command can be
// retried.
if msg, ok := maybeDescriptorChangedError(desc, err); ok {
// NB: we have to wrap the existing error here as consumers of this
// code look at the root cause to sniff out the changed descriptor.
err = &benignError{errors.Wrap(err, msg)}
}
return reply, errors.Wrapf(err, "split at key %s failed", splitKey)

19.1:

}); err != nil {
// The ConditionFailedError can occur because the descriptors acting
// as expected values in the CPuts used to update the left or right
// range descriptors are picked outside the transaction. Return
// ConditionFailedError in the error detail so that the command can be
// retried.
if msg, ok := maybeDescriptorChangedError(desc, err); ok {
// NB: we have to wrap the existing error here as consumers of this
// code look at the root cause to sniff out the changed descriptor.
err = &benignError{errors.Wrap(err, msg)}
}
return reply, errors.Wrapf(err, "split at key %s failed", splitKey)
}
return reply, nil
}

The second failure is that this assertion gets hit when a preemptive snapshot gets sent by a 19.1 node to a 19.2 node:

tobias-1567423429-01-n5cpu16-0004> F190902 11:43:11.784656 125964 storage/replica.go:982  [n4,s4,r2109/?:/Table/56/1/398/6/949{-/0}] on-disk and in-memory state diverged: [Lease.Replica.Type: &roachpb.ReplicaType(0) != nil]
goroutine 125964 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000466101, 0xc000466120, 0x0, 0x5a)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7bab9e0, 0xc000000004, 0x7378f32, 0x12, 0x3d6, 0xc00095c120, 0x85)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e29b00, 0xc0173dc5d0, 0xc000000004, 0x2, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e29b00, 0xc0173dc5d0, 0x1, 0xc000000004, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:189
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).assertStateLocked(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x4e77b20, 0xc0004d7500)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:982 +0x74c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applySnapshot(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, 0xc0007a9180, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:960 +0xf3a
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest.func1(0x4e29b00, 0xc0173dc5d0, 0xc0034db000, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store_snapshot_preemptive.go:343 +0x6bb
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc5d0, 0xc0007a8f48, 0xc0069b1418, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3352 +0x150
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc270, 0xc0007a8f00, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, ...)

That seems relatively straightforward to fix, if there isn't some unexpected snag.

@tbg
Copy link
Member

tbg commented Sep 2, 2019

I have a theory for the Lease crash. I think that when a preemptive snapshot is sent from a 19.2 node to a 19.1 node, the snapshot disk data contains a Lease whose ReplicaDescriptor likely has the replica type VOTER_FULL (instead of the equivalent, deprecated, nil).

Now if that old node sends a snapshot to a new node, it will send it with a descriptor that has the lease replica's type unspecified, i.e. nil. But the corresponding disk state does have a value that's non-nil, boom.

This is the sort of thing the XXX_unrecognized fields would prevent, but we turned them off four years ago (daf39dd).

Still have to verify that this is the case. I'm more mystified by the split error, which I can't explain yet.

@tbg
Copy link
Member

tbg commented Sep 2, 2019

I think this fixes the lease crash:

diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index bf3a7bf687..f9bd4ee50d 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -740,6 +740,7 @@ func (r *Replica) applySnapshot(
 	if s.Desc.RangeID != r.RangeID {
 		log.Fatalf(ctx, "unexpected range ID %d", s.Desc.RangeID)
 	}
+	inSnap.State = nil // prevent accidental reuse below
 
 	// Strip the DeprecatedTxnSpanGCThreshold. We don't care about it.
 	// TODO(nvanbenschoten): Remove in 20.1.
@@ -913,6 +914,25 @@ func (r *Replica) applySnapshot(
 	// has not yet been updated. Any errors past this point must therefore be
 	// treated as fatal.
 
+	// Re-load the ReplicaState from disk. The state so far was handed to us by
+	// some remote node that may not have been able to correctly represent the
+	// on-disk state. For an example of this, consider the ReplicaState.Type
+	// field which was added in 19.2. A 19.1 node receiving a 19.2 snapshot
+	// could have an on-disk ReplicaDescriptor with a non-nil type, but in
+	// memory this would not be represented, and thus encoded on the wire as
+	// the nil value. If that 19.1 node later sent a snapshot to a 19.2 node,
+	// it would send the node a nil type in inSnap.State, but the on-disk data
+	// would represent the corresponding value, which would trip assertions.
+	// See https://github.com/cockroachdb/cockroach/issues/39460 for details.
+	{
+		sl := stateloader.Make(s.Desc.RangeID)
+		var err error
+		s, err = sl.Load(ctx, r.store.Engine(), s.Desc)
+		if err != nil {
+			return errors.Wrap(err, "loading state")
+		}
+	}
+
 	if err := r.clearSubsumedReplicaInMemoryData(ctx, subsumedRepls, subsumedNextReplicaID); err != nil {
 		log.Fatalf(ctx, "failed to clear in-memory data of subsumed replicas while applying snapshot: %+v", err)
 	}

@tbg
Copy link
Member

tbg commented Sep 2, 2019

Oh, regarding the failed splits -- this is unfortunately very likely just the whole CPut mess that wasn't fixed on the 19.x branch: #38302
And even if we did backport that, we don't require users to upgrade into the latest patch release.

The problem here is that once a descriptor has a 19.2-specific field, 19.1 nodes will always fail their cput:

func TestFoo(t *testing.T) {
	s := []byte("\035\244\226 \003\010\205\023\022\t\300\211\367\001I\367\2205\210\032\010\300\211\367\001`\367\\\250\"\006\010\003\020\003\030\001\"\010\010\001\020\001\030\002 \000\"\006\010\004\020\004\030\003(\0040\025")
	v := Value{RawBytes: s}
	var desc RangeDescriptor
	require.NoError(t, v.GetProto(&desc))
	t.Error(pretty.Sprint(&desc))
}

the 19.1 node sees:

--- FAIL: TestFoo (0.00s)
    data_test.go:1522: &roachpb.RangeDescriptor{
            RangeID:  2437,
            StartKey: {0xc0, 0x89, 0xf7, 0x1, 0x49, 0xf7, 0x90, 0x35, 0x88},
            EndKey:   {0xc0, 0x89, 0xf7, 0x1, 0x60, 0xf7, 0x5c, 0xa8},
            Replicas: {
                {
                    NodeID:               3,
                    StoreID:              3,
                    ReplicaID:            1,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
                {
                    NodeID:               1,
                    StoreID:              1,
                    ReplicaID:            2,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
                {
                    NodeID:               4,
                    StoreID:              4,
                    ReplicaID:            3,
                    XXX_NoUnkeyedLiteral: struct {}{},
                    XXX_sizecache:        0,
                },
            },
            NextReplicaID:        4,
            Generation:           &int64(21),
            XXX_NoUnkeyedLiteral: struct {}{},
            XXX_sizecache:        0,
        }
FAIL

Process finished with exit code 1

but 19.2 reveals that the proto really has a replica with type &zero instead of nil.

  data_test.go:1891: &roachpb.RangeDescriptor{
            RangeID:          2437,
            StartKey:         {0xc0, 0x89, 0xf7, 0x1, 0x49, 0xf7, 0x90, 0x35, 0x88},
            EndKey:           {0xc0, 0x89, 0xf7, 0x1, 0x60, 0xf7, 0x5c, 0xa8},
            InternalReplicas: {
                {
                    NodeID:    3,
                    StoreID:   3,
                    ReplicaID: 1,
                    Type:      (*roachpb.ReplicaType)(nil),
                },
                {
                    NodeID:    1,
                    StoreID:   1,
                    ReplicaID: 2,
                    Type:      &roachpb.ReplicaType(0),
                },
                {
                    NodeID:    4,
                    StoreID:   4,
                    ReplicaID: 3,
                    Type:      (*roachpb.ReplicaType)(nil),
                },
            },
            NextReplicaID:        4,
            Generation:           &int64(21),
            GenerationComparable: (*bool)(nil),
            StickyBit:            (*hlc.Timestamp)(nil),
        }

On the plus side, the diff in my last post seems to have resolved the assertion crashes.

@tbg
Copy link
Member

tbg commented Sep 2, 2019

Going to run 10 instances in which I modified the master binary to use nil instead of &VOTER_FULL (the &0 above). That should do it, let's see

@tbg
Copy link
Member

tbg commented Sep 2, 2019

They're all passing the import now but then immediately going bust with errors such as

            Error: error in orderStatus: select order fail: can't scan into dest[2]: invalid length for int8: 4
                Error:  exit status 1
                : exit status 1

paging @jordanlewis

@jordanlewis
Copy link
Member

I think this would be caused by the server returning 4 bytes for a 4 byte integer instead of 8 like we used to.

Does this use fixtures import or fixtures load? What's the data source? My guess is that we just need to update the test to use int4 instead of int8, but I'm not sure yet.

@jordanlewis
Copy link
Member

Hm, though it looks like all of the schemas say integer/int8. I'm not sure quite how you're reproducing this @tbg - let me know please, I can't tell if I need some non-master bulid.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

Another thing to investigate here: on two out of ten clusters, the imports got stuck:

image

image

@tbg
Copy link
Member

tbg commented Sep 3, 2019

Looking at the second one through a 19.1 ui (n1):

image

image

Looks like things are pretty grim. This node somehow can't even figure out the replicas in the descriptor?

If I open the same page through n2, it still looks unhappy, but not as broken:

image

predictably the logs are full of messages related to the fact that the range descriptor has turned to garbage.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

./cockroach debug range-descriptors seems to give out reasonable data both on the 19.1 and 19.2 nodes:

https://gist.github.com/tbg/bd89db53e7b1e852ffca27f758800567

I can see from http://tobias-1567451338-04-n5cpu16-0002.roachprod.crdb.io:26258/_status/raft that the 19.2 nodes have this populated properly in-memory, but the 19.1 nodes return an empty internalReplicas. Interestingly, they are also missing NextReplicaID, which is puzzling - nothing ever changed about that.

Phew, whatever is going on here, things are majorly broken.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

@jordan here's a repro, I extended that cluster to 24h

./bin/roachprod run tobias-1567451338-07-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=0m25s --duration=2h0m0s {pgurl:2}

Note that this one works:

./bin/roachprod run tobias-1567451338-07-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=perf/stats.json  --ramp=0m25s --duration=2h0m0s {pgurl:1}

This is weird to me because :1 is a 19.1 node, :2 is a 19.2 node. I would've expected things to fail against the old one, not the other way around.

I initially thought that I probably messed up the workload binary, but that doesn't seem to be it; I replaced it with one built fresh from my branch (i.e. master).

@tbg
Copy link
Member

tbg commented Sep 3, 2019

I'd hazard a guess that this explains something:

tobias-1567451338-04-n5cpu16-0002> I190902 19:11:43.754115 2703 storage/replica_command.go:1433  [n2,s2,r204/1:/Table/58/1/108/1/-2158/9…] change replicas (add [(n1,s1):4] remove []): existing descriptor r204:/Table/58/1/108/1/-2158/9{-/0} [(n2,s2):1, (n3,s3):2, (n4,s4):3, next=4, gen=19?]
tobias-1567451338-04-n5cpu16-0003> I190902 19:11:43.768641 1996 storage/replica_raft.go:396  [n3,s3,r204/2:/Table/58/1/108/1/-2158/9…] proposing ADD_REPLICA((n1,s1):4): updated=[] next=0

n2 (19.2) runs a change replicas, and it looks like it's actually proposed on n3 (it probably has the lease). n3 is a 19.1 node, and it looks like it botches things pretty badly the "updated=[] next=0" indicates (we're supposed to see the new replica set and nextreplicaid there).

I'm seeing this in the other broken cluster too (the -03 one):

tobias-1567451338-03-n5cpu16-0002> I190902 19:25:34.697855 94972 storage/replica_command.go:1433  [n2,s2,r4746/1:/Table/60/1/8{20/495…-41/614…}] change replicas (add [] remove [(n4,s4):3]): existing descriptor r4746:/Table/60/1/8{20/49582-41/61499} [(n2,s2):1, (n3,s3):2, (n4,s4):3, (n1,s1):4, next=5, gen=14?]
tobias-1567451338-03-n5cpu16-0003> I190902 19:25:34.705844 33266 storage/replica_raft.go:396  [n3,s3,r4746/2:/Table/60/1/8{20/495…-41/614…}] proposing REMOVE_REPLICA((n4,s4):3): updated=[] next=0

Now to figure out what this is. Pretty sure I must have broken this; I refactored a number of things about the change replicas trigger recently.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

^- ok I think the problem is actually not of my own doing, but it's 52333d1

That commit "migrated" the UpdatedReplicas field into a Desc field, but did so on the incorrect assumption that the coordinator node of a replication change would be the evaluator/proposer, and that is incorrect. 19.2 nodes pass a trigger to txn.Run which is missing the UpdatedReplicas field. If the evaluator/proposer ends up being a 19.1 node, it will just use this empty UpdatedReplicas slice and bake a range descriptor without replicas, and then propose that to Raft, and then all of the nodes get to enjoy a corrupted state.

Fix should be easy, I'll add something to my branch.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

@jordanlewis it seems to be working now (i.e. no more int4/int8 errors). It's relatively likely that I used some old workload binary initially to restore the stuff, so I'll blame it on that.

@tbg
Copy link
Member

tbg commented Sep 3, 2019

Even better, the imports all went through without a hitch. I might get lucky and actually get all these tests to pass, who knows?

@tbg
Copy link
Member

tbg commented Sep 3, 2019

They all passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants