-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestReplicateQueueRebalance failed [non-zero HardState.Commit on uninitialized] #77030
Comments
@sumeerbhola @tbg This seems like it could be related to the recent Raft changes. |
Here are the log statements pertaining to range r1 on n1,s1. The failure happens due to
|
|
@tbg do you have any thoughts? |
Thanks for having a look @sumeerbhola, we'll pick this up. |
Got a couple of repros with |
Gonna also go through the logs to see if I can understand any of this. Edit: trimmed logs |
First of all, why don't we get all stack traces? Is GOTRACEBACK not set correctly? I don't see it here but also not sure how it works now with bazel. cc @rickystewart |
I think the history makes sense, with the exception of the crash. r1 is on n1 initially (makes sense), then it gets moved off and the replica is destroyed (makes sense). It then gets re-added as a learner (why not) and now things get weird because it receives a |
|
Bisecting is a bit iffy since it sometimes passes for hundreds of runs. But I'll keep at it. |
Bisected to 1bb58e5. |
@tbg, I see raft.newRaft calling raft.Storage.InitialState, which would make sense. Wouldn't that propagate into the Ready.HardState? Which is why my earlier guess was that we hadn't properly cleaned up the previous |
Good point, let me take a quick look before I head out. It is suspicious that we're seeing cockroach/pkg/kv/kvserver/store_remove_replica.go Lines 148 to 150 in d35cf75
but not the logging inside of cockroach/pkg/kv/kvserver/store_remove_replica.go Lines 177 to 182 in d35cf75
I was hoping to find a hole, where we unlink cockroach/pkg/kv/kvserver/store_create_replica.go Lines 82 to 83 in d35cf75
if it were to see the |
This is a temporary change, until we can find the root cause of cockroachdb#77030. Release justification: Temporary bug workaround. Release note: None
It isn't HardState from the old replica based on running
Fuller logs
|
77254: sql/importer: use desc's PK ID to sort KVs by index r=dt a=dt Release note: none. Fixes: #76196. Release justification: low-risk bug fix. 77278: kvserver: set kv.raft_log.loosely_coupled_truncation.enabled to false r=erikgrinaker a=sumeerbhola This is a temporary change, until we can find the root cause of #77030. Release justification: Temporary bug workaround. Release note: None Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
The instrumentation is in sumeerbhola@7db8067 Since there is a log statement for every call to StateLoader.SetHardState we are not setting any HardState between the last
Hmm, I don't think the following code in
|
sumeerbhola@7edb8f4 has been running for ~5000 runs on gceworker without failure. now at ~19,000 successful runs over ~8 hours |
Uh, yeah, that missing |
So with the mutex gap the following could happen:
I still have to verify 6., but it seems like an obvious problem if there is a bogus truncated state on an uninitialized replica. |
when RawNode is initialized, it sets which essentially equals https://github.dev/etcd-io/etcd/blob/8ac44ffa5fcccc7928876be4682c07f50b5e3b7e/raft/raft.go#L323 https://github.dev/etcd-io/etcd/blob/8ac44ffa5fcccc7928876be4682c07f50b5e3b7e/raft/log.go#L65-L76 and cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 62 to 71 in ac7e0b6
Now when the leaseholder probes us, we are going to respond saying that our logs start at index 41. The leader will thus send us an append(idx=[42, 43, ..., 100], committed=100) (or something like that), i.e. entries we shouldn't be receiving yet. I just think that the moving |
…77328 #77335 75751: sql: Add DateStyle/IntervalStyle visitor r=e-mbrown a=e-mbrown The DateStyle visitor allows for cast expressions with string to interval and date/interval types to string cast to be rewritten. These stable cast cause issues with DateStyle/IntervalStyle formatting so they need to be wrapped in builtins containing their immutable version. Release note: None Release justification: Low risk update to new functionality 76705: backupccl: add prototype metadata.sst r=rhu713 a=rhu713 This adds writing of an additional file to the completion of BACKUP. This new file is an sstable that contains the same metadata currently stored in the BACKUP_MANIFEST file and statistics files, but organizes that data differently. The current BACKUP_MANIFEST file contains a single binary-encoded protobuf message of type BackupManifest, that in turn has several fields some of which are repeated to contain e.g. the TableDescriptor for every table backed up, or every revision to every table descriptor backed up. This can result in these manifests being quite large in some cases, which is potentially concerning because as a single protobuf message, one has to read and unmarshal the entire struct into memory to read any field(s) of it. Organizing this metadata into an SSTable where repeated fields are instead stored as separate messages under separate keys should instead allow reading it incrementally: one can seek to a particular key or key prefix and then scan, acting on whatever data is found as it is read, without loading the entire file at once (when opened using the same seek-ing remote SST reader we use to read backup data ssts). This initial prototype adds only the writer -- RESTORE does not rely on, or even open, this new file at this time. Release note: none. 77018: release: automate orchestration version update r=celiala a=rail Previously, as a part of the release process we had to bump the orchestration versions using `sed` with some error-prone regexes. This patch adds `set-orchestration-version` subcommand to the release tool. It uses templates in order to generate the orchestration files. Release note: None 77055: sql: change index backfill merger to use batch api r=rhu713 a=rhu713 Use Batch API instead of txn.Scan() in order to limit the number of bytes per batch response in the index backfill merger. Fixes #76685. Release note: None 77065: bazel: use test sharding more liberally r=rail a=rickystewart Closes #76376. Release note: None 77109: ccl/sqlproxyccl: add helpers related to connection migration r=JeffSwenson,andy-kimball a=jaylim-crl #### ccl/sqlproxyccl: add helpers related to connection migration Informs #76000. Extracted from #76805. This commit adds helpers related to connection migration. This includes support for retrieving the transfer state through SHOW TRANSFER STATE, as well as deserializing the session through crdb_internal.deserialize_session. Release note: None Release justification: Helpers added in this commit are needed for the connection migration work. Connection migration is currently not being used in production, and CockroachCloud is the only user of sqlproxy. #### ccl/sqlproxyccl: fix math for defaultBufferSize in interceptors Previously, we incorrectly defined defaultBufferSize as 16K bytes. Note that 2 << 13 is 16K bytes. This commit fixes that behavior to match the original intention of 8K bytes. Release note: None Release justification: This fixes an unintentional buglet within the sqlproxy code that was introduced with the interceptors back then. Not having this in means we're using double the memory for each connection within the sqlproxy. 77307: sql: add cluster setting to limit max size of serialized session r=otan,jaylim-crl a=rafiss fixes #77302 The sql.session_transfer.max_session_size cluster setting can be used to limit the max size of a session that is serialized using crdb_internal.serialize_session. No release note since this is not a public setting. Release justification: high priority fix for new functionality. Release note: None 77318: roachpb: extract keysbase to break some dependencies r=yuzefovich a=yuzefovich This commit extracts a couple of things out of `roachpb` into new `keysbase` package in order to break the dependency of `util/json` and `sql/inverted` on `roachpb` (which is a part of the effort to clean up the dependencies of `execgen`). Addresses: #77234. Release note: None Release justification: low risk change to clean up the dependencies. 77319: sessiondatapb: move one enum definition into lex package r=yuzefovich a=yuzefovich This commit moves the definition of `BytesEncodeFormat` enum from `sessiondatapb` to `lex`. This is done in order to make `lex` not depend on a lot of stuff (eventually on `roachpb`) and is a part of the effort to clean up the dependencies of `execgen`. Note that the proto package name is not changed, so this change is backwards-compatible. Informs: #77234. Release note: None Release justification: low risk change to clean up the dependencies. 77328: roachtest: log stdout and stderr in sstable corruption test r=itsbilal a=nicktrav To aid in debugging #77321, log the contents stdout and stderr if the manifest dump command fails. Release justification: Tests only. Release note: None. 77335: kvserver: fix race that caused truncator to truncate non-alive replica r=tbg,erikgrinaker a=sumeerbhola This was causing truncated state to be written to such a replica, which would then get picked up as the HardState.Commit value when a different replica was later added back for the same range. See #77030 (comment) for the detailed explanation. Also restore the default value of kv.raft_log.loosely_coupled_truncation.enabled to true. Fixes #77030 Release justification: Bug fix. Release note: None Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Jay <jay@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
kv/kvserver.TestReplicateQueueRebalance failed with artifacts on master @ 94e64ce989a374e796cfaad95cb34b3702b9a6e2:
Help
See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:
This test on roachdash | Improve this report!
Jira issue: CRDB-13375
The text was updated successfully, but these errors were encountered: