roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

cockroach-teamcity · 2023-03-02T16:21:51Z

roachtest.sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed with artifacts on master @ 20e2adda3c76c7172dd986c871df0ae9a346918f:

test artifacts and logs in: /artifacts/sysbench/oltp_write_only/nodes=3/cpu=32/conc=256/run_1
(cluster.go:1969).Run: output in run_161729.508110704_n4_sysbench-dbdriverpgs: sysbench \
		--db-driver=pgsql \
		--pgsql-host={pghost:1} \
		--pgsql-port=26257 \
		--pgsql-user=root \
		--pgsql-password= \
		--pgsql-db=sysbench \
		--report-interval=1 \
		--time=600 \
		--threads=256 \
		--tables=10 \
		--table_size=10000000 \
		--auto_inc=false \
		oltp_write_only prepare returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 7)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=32 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-24968}

The text was updated successfully, but these errors were encountered:

renatolabs · 2023-03-02T16:48:32Z

Quite a few liveness errors and eventually node 1 crashes with a panic from etcd/raft:

go.etcd.io/raft/v3/log.go:324 ⋮ [T1,n1,s1,r101/1:‹/Table/11{1/1/3781…-2}›] 739 
tocommit(489) is out of range [lastIndex(427)]. Was the raft log corrupted, truncated, or lost?

cc @cockroachdb/replication

blathers-crl · 2023-03-02T16:49:19Z

cc @cockroachdb/replication

erikgrinaker · 2023-03-06T11:39:21Z

@pavelkalinnikov This seems like a high-priority problem, would appreciate if you could have an initial look.

pav-kv · 2023-03-06T15:49:36Z

Node 1 got behind, and fails on an empty MsgApp in the probing state:

{
  "span": {
    "start_key": "/Table/111/1/3781091",
    "end_key": "/Table/112"
  },
  "raft_state": {
    "replica_id": 2,
    "hard_state": {
      "term": 6,
      "vote": 2,
      "commit": 821
    },
    "lead": 2,
    "state": "StateLeader",
    "applied": 821,
    "progress": {
      "1": {
        "match": 427,
        "next": 428,
        "state": "StateProbe",
        "paused": true
      },
      "2": {
        "match": 821,
        "next": 822,
        "state": "StateReplicate"
      },
      "3": {
        "match": 821,
        "next": 822,
        "state": "StateReplicate"
      }
    }
  },
  "state": {
    "state": {
      "raft_applied_index": 821,
      "lease_applied_index": 508,

pav-kv · 2023-03-06T16:23:40Z

Possibly this is the bug that @tbg mentioned in this comment (see below "Also, isn't there a (pre-existing) bug here?").

pav-kv · 2023-03-06T18:17:50Z

Same symptoms in another failure here: #97389 (comment)

pav-kv · 2023-03-06T19:33:24Z

The message causing the panic is a MsgApp sent from here:

pb.Message{
	Type:    pb.MsgApp,
	To:      1,
	From:    2,
	Term:    6,
	Index:   489,
	LogTerm: 0,  // <--- Why is this 0?
	Entries: [], // <--- Looks like no entries were found with index >= 489 (or max-inflight is saturated)?
	Commit:  660,
}

raftLog.maybeAppend checks matchTerm, but the latter silently accepts this 0 term as matching: https://github.com/etcd-io/raft/blob/d9907d6ac6baaebc3c9fd4e67acaa4154d2b3cd3/log.go#L391-L396. Hence we're entering the branch which asserts.

LogTerm isn't supposed to be 0 in the first place, we need to understand why this is the case. It can happen if an entry is not found. But how can it be not found in maybeSendAppend where this value is populated?

pav-kv · 2023-03-06T19:58:53Z

@nvanbenschoten Could something like this be caused by async log appends introduced in #94165? We're seeing raftLog.term returns 0 (i.e. "not found") for an entry index which is already known to be committed.

pav-kv · 2023-03-06T20:33:52Z

Possibly index 489 got truncated on the leader by the time raftLog.term was called. There is a line like this:

I230302 16:20:52.380104 83436 kv/kvserver/raft_log_queue.go:723 ⋮ [T1,n3,raftlog,s3,r101/2:‹/Table/11{1/1/3781…-2}›] 160  should truncate: true [truncate 31 entries to first index 510 (chosen via: last index); log too large (16 MiB > 16 MiB); implies 1 Raft snapshot]

nvanbenschoten · 2023-03-06T20:38:57Z

raftLog.maybeAppend checks matchTerm, but the latter silently accepts this 0 term as matching: https://github.com/etcd-io/raft/blob/d9907d6ac6baaebc3c9fd4e67acaa4154d2b3cd3/log.go#L391-L396. Hence we're entering the branch which asserts.

I came to the same conclusion. If it is legitimate for a MsgApp to carry a LogTerm of 0, then this receiver-side code is broken. Is this legitimate when sendIfEmpty = true? That depends on the relationship between a raft leader's own log and the Progress.Next of all of its followers.

If it's not legitimate then the receiver-side code could probably be improved, but it's not the root of the problem. Instead, we'll need to look at the leader and understand whether we're hitting the pr.Next < l.firstIndex() case or the pr.Next - 1 > l.lastIndex() case. @pavelkalinnikov have you seen any indication of which case we're hitting here? We're the raft leader, so presumably the first case.

pav-kv · 2023-03-06T20:40:06Z

@nvanbenschoten Yes, likely we're hitting the first case, see the log truncation message above.

nvanbenschoten · 2023-03-06T20:43:23Z

Interesting. When we're dealing with some kind of race with log truncation, do we have any reason to expect raftLog.term to reach the call to storage.Term and return raft.ErrCompacted, vs. only reaching the call to storage.FirstIndex and hitting this problematic return 0, nil return path?

pav-kv · 2023-03-06T21:35:36Z

I'm struggling to see how at all we would hit the ErrCompacted branch here. I think we would always see this return 0.

raftLog.firstIndex will always return an index:

either (rarely) the index of the unstable snapshot in memory about to be applied,
or (most likely) the Storage.FirstIndex for an already applied to storage snapshot.

To get an ErrCompacted from Storage.Term, we should first have seen that raftLog.firstIndex()-1 <= i <= lastIndex, which in case 1 and 2 means:

unstable snapshot index <= i <= lastIndex, but unstable snapshot index is probably >= stored snapshot index, so we won't get ErrCompacted
Storage.FirstIndex()-1 <= i <= lastIndex, which is again a prereq for not getting ErrCompacted.

pav-kv · 2023-03-07T10:15:08Z

Culprit: etcd-io/raft@42419da

I couldn't repro this panic yet, but found another one that this commit causes (bisected to verify). Trying to repro this one too, it seems to have the same underlying cause - sending a zero LogTerm after log truncation.

pav-kv · 2023-03-07T12:39:16Z

Found a repro, working in upstream to fix it: etcd-io/raft#31.

nvanbenschoten · 2023-03-07T14:54:05Z

@pavelkalinnikov nice find! Could you explain why etcd-io/raft@42419da is the culprit? We've primarily been looking at raftLog.term and that seems broken independent of your change. Do we need to fix this return 0, nil path?

pav-kv · 2023-03-07T15:30:41Z

@nvanbenschoten In etcd-io/raft#31 there is a test which simulates the behaviour in this issue: a Raft log truncation + a bit of slowness on a follower so that truncation on the leader overtakes the appends flow to the follower. I bisected, and this test starts panicking (with the same message) right at the culprit commit that I linked.

The reason why my change broke it is: previously we unconditionally called raftLog.entries which would return ErrCompacted; after my change there is the Inflights.Full() case in which we skip fetching entries, and won't get this ErrCompacted (so will proceed to sending the zero term).

Yes, broadly speaking we need to fix or workaround the return 0, nil path.

98574: sql: support tenant configuration templates r=stevendanna,ecwall a=knz Fixes #98573. Epic: CRDB-23559 First commit from #98726. This change introduces the LIKE clause to CREATE TENANT, which makes CREATE TENANT copy the parameters (but not the storage keyspace) from the tenant selected by LIKE. Also if LIKE is not specified, but the (new) cluster setting `sql.create_tenant.default_template` is not empty, the value of the cluster setting is used implicitly as LIKE clause. A proposed use of this is cluster-to-cluster replication, considering cutover as well. On the target (sink) cluster, the operator would do: ``` CREATE TENANT application LIKE app_template FROM REPLICATION OF application ON .... ``` And then cutover would look something like the following if they wanted the tenant to still be named "application" ``` ALTER TENANT application CUTOVER TO LATEST; DROP TENANT application; -- if there's one already ALTER TENANT application START SERVICE SHARED; ``` Release note: None 98721: go.mod: bump etcd-io/raft to 5fe1c31 r=tbg a=pavelkalinnikov Fixes #97926 Epic: none Release note (bug fix): fixed a rare panic in upstream etcd-io/raft when message appends race with log compaction 98747: kvserver: deflake TestReplicaProbeRequest r=pavelkalinnikov a=tbg When we ignored an ambigous result but the probe didn't actually happen, a later condition in the test would fail. Retry the probe on ambiguous results instead; the test already only expects the probe to happen "at least once", so we don't introduce any new issues should a successful probe end up being retried. Fixes #97136. Epic: none Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

cockroach-teamcity added this to the 23.1 milestone Mar 2, 2023

blathers-crl bot added the T-testeng TestEng Team label Mar 2, 2023

renatolabs added the T-kv-replication label Mar 2, 2023

renatolabs removed the T-testeng TestEng Team label Mar 2, 2023

pav-kv self-assigned this Mar 6, 2023

aliher1911 mentioned this issue Mar 6, 2023

roachtest: sysbench/oltp_update_index/nodes=3/cpu=32/conc=256 failed #97389

Closed

pav-kv added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 6, 2023

erikgrinaker changed the title ~~roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed~~ roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] Mar 7, 2023

pav-kv mentioned this issue Mar 15, 2023

go.mod: bump etcd-io/raft to 5fe1c31 #98721

Merged

craig bot closed this as completed in 6e4dd83 Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

cockroach-teamcity commented Mar 2, 2023 •

edited by cockroach-jira-scripts

Loading

renatolabs commented Mar 2, 2023

blathers-crl bot commented Mar 2, 2023

erikgrinaker commented Mar 6, 2023

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 •

edited

Loading

nvanbenschoten commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023

nvanbenschoten commented Mar 6, 2023

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 7, 2023

pav-kv commented Mar 7, 2023

nvanbenschoten commented Mar 7, 2023

pav-kv commented Mar 7, 2023 •

edited

Loading

roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926

Comments

cockroach-teamcity commented Mar 2, 2023 • edited by cockroach-jira-scripts Loading

renatolabs commented Mar 2, 2023

blathers-crl bot commented Mar 2, 2023

erikgrinaker commented Mar 6, 2023

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 • edited Loading

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 • edited Loading

pav-kv commented Mar 6, 2023

pav-kv commented Mar 6, 2023 • edited Loading

nvanbenschoten commented Mar 6, 2023 • edited Loading

pav-kv commented Mar 6, 2023

nvanbenschoten commented Mar 6, 2023

pav-kv commented Mar 6, 2023 • edited Loading

pav-kv commented Mar 7, 2023

pav-kv commented Mar 7, 2023

nvanbenschoten commented Mar 7, 2023

pav-kv commented Mar 7, 2023 • edited Loading

cockroach-teamcity commented Mar 2, 2023 •

edited by cockroach-jira-scripts

Loading

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023 •

edited

Loading

nvanbenschoten commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 6, 2023 •

edited

Loading

pav-kv commented Mar 7, 2023 •

edited

Loading