-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: cdc/crdb-chaos/rangefeed=true failed [skipped] #35974
Comments
|
SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog
|
"descriptor not found" must have come out of the
But this is only called when we get a kv that has changed and it's called with the table ID and mvcc timestamp of the kv, so it's hard to imagine that we're actually missing that table descriptor. This is happening when a node is going down, so it's not hard to imagine that it's another missing retry. |
SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a200cea4368ec90aaee12337d7ab5f9ca555108f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191939&tab=buildLog
|
|
SHA: https://github.com/cockroachdb/cockroach/commits/6cac063ae1cb578130afbafb2abf4035268a10c9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194308&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5a746073c3f8ede851f37dd895cf1a91d6dcc3cf Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195714&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c59f5347d5424edb90575fb0fd50bad677953752 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1195732&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/7bc9ea5fbe0c0082fdcfd408245a79c62b00edd4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1197065&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/ec89b45cea7e8a6dd92a9bfd60e0cc06842e06d8 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1197083&tab=buildLog
|
…list In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds fail with surprising errors: [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is descriptor not found We'd like to avoid failing a changefeed unnecessarily, so when an error bubbles up to the top level, we'd like to retry the distributed flow if possible. We initially tried to whitelist which errors should cause the changefeed to retry, but this turns out to be brittle, so this commit switches to a blacklist. Any error that is expected to be permanent is now marked with `MarkTerminalError` by the time it comes out of `distChangefeedFlow`. Everything else should be logged loudly and retried. Touches cockroachdb#35974 Touches cockroachdb#36019 Release note: None
SHA: https://github.com/cockroachdb/cockroach/commits/25398c010b2af75b11fed189680ea6b9645f0cf5 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199659&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/7f8a0969e8e9eb7e9fc0d2fe96e03849d30dd561 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199677&tab=buildLog
|
…list In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds fail with surprising errors: [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is descriptor not found We'd like to avoid failing a changefeed unnecessarily, so when an error bubbles up to the top level, we'd like to retry the distributed flow if possible. We initially tried to whitelist which errors should cause the changefeed to retry, but this turns out to be brittle, so this commit switches to a blacklist. Any error that is expected to be permanent is now marked with `MarkTerminalError` by the time it comes out of `distChangefeedFlow`. Everything else should be logged loudly and retried. Touches cockroachdb#35974 Touches cockroachdb#36019 Release note: None
…list In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds fail with surprising errors: [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is descriptor not found We'd like to avoid failing a changefeed unnecessarily, so when an error bubbles up to the top level, we'd like to retry the distributed flow if possible. We initially tried to whitelist which errors should cause the changefeed to retry, but this turns out to be brittle, so this commit switches to a blacklist. Any error that is expected to be permanent is now marked with `MarkTerminalError` by the time it comes out of `distChangefeedFlow`. Everything else should be logged loudly and retried. Touches cockroachdb#35974 Touches cockroachdb#36019 Release note: None
36132: changefeedccl: switch high-level retry marker from whitelist to blacklist r=nvanbenschoten a=danhhz In the roachtests for crdb-chaos and sink-chaos we're seeing changefeeds fail with surprising errors: [NotLeaseHolderError] r681: replica (n1,s1):1 not lease holder; replica (n2,s2):2 is descriptor not found We'd like to avoid failing a changefeed unnecessarily, so when an error bubbles up to the top level, we'd like to retry the distributed flow if possible. We initially tried to whitelist which errors should cause the changefeed to retry, but this turns out to be brittle, so this commit switches to a blacklist. Any error that is expected to be permanent is now marked with `MarkTerminalError` by the time it comes out of `distChangefeedFlow`. Everything else should be logged loudly and retried. Touches #35974 Touches #36019 Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
SHA: https://github.com/cockroachdb/cockroach/commits/17565100d1e7c66341e6db3e39bb66202958cb81 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204567&tab=buildLog
|
It does seem odd that the latency of this latest failure was so high, but I'm not entirely surprised that we don't consistenly handle recovery from crdb nodes crashing in a prompt way. Definitely worth looking into what happened and if we can make it more predictable, but the fact that it recovered at all is encouraging. |
SHA: https://github.com/cockroachdb/cockroach/commits/3aadd20bbf0940ef65f8b2cdcda498401ba5d9c6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1206925&tab=buildLog
|
This last one failed due to the newly added dead node detection. @tbg should we skip that check for choas tests? Or should I just throw a |
The chaos runner makes sure that nodes are always restarted, are you using something homegrown here? cockroach/pkg/cmd/roachtest/chaos.go Lines 97 to 100 in 875e7ff
|
If you ask me the dead node detection is just a collateral, the test failed because the SQL workload failed and that cancelled the context and so the chaos runner didn't bother restarting the node. I'm out for today, but if you wanted to make a change to make sure that it always did so, I would appreciate that (should be a two liner) |
Ah, you're right. I saw the "dead node detection" line on a crdb-chaos test and jumped to conclusions. The changefeed failure (which is what failed the test, the workload is run with --tolerate-errors in this test) is yet another example of why we switched to a blacklist for terminal errors. This test does use something homegrown for crdb chaos, which doesn't guarantee the chaos'd node is restarted when the test shuts down. Given that we have a common one, it sounds like switching to it is the answer to my question above. |
(Switching SGTM -- but that runner will also exit with a down node if the ctx cancels, so while you're there just make sure that cockroach/pkg/cmd/roachtest/chaos.go Line 102 in 875e7ff
also restarts the node) |
👍 Will do! (though I doubt I'll get to it today) |
SHA: https://github.com/cockroachdb/cockroach/commits/d03a34e92d2ee558fb6aedb0709b733a1fab97f4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1207666&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/23f9707873abbd2de91a42055535529d7ff296ce Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1209900&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/5921cf0dcc76548931cc85500c0fa2186a82142f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212185&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a039a93a5cc6eb3f395ceb6f7dc8030985dccc29 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1212269&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/2851c7d56ee4966109691b5c48c73ec8d4cc9847 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215354&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/877ebd1ece299b9ee621aa0d091657621593d844 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1215372&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog
|
For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been failing because an error that should be marked as retryable wasn't. As a result of the discussion in cockroachdb#35974, I tried switching from a whitelist (retryable error) to a blacklist (terminal error) in cockroachdb#36132, but on reflection this doesn't seem like a great idea. We added a safety net to prevent false negatives from retrying indefinitely but it was immediately apparent that this meant we needed to tune the retry loop parameters. Better is to just do the due diligence of investigating the errors that should be retried and retrying them. The commit is intended for backport into 19.1 once it's baked for a bit. Closes cockroachdb#35974 Closes cockroachdb#36018 Closes cockroachdb#36019 Closes cockroachdb#36432 Release note (bug fix): `CHANGEFEED` now retry instead of erroring in more situations
36804: sql/sem/pretty: use left alignment for column names in CREATE r=knz a=knz Before: ``` CREATE TABLE t ( name STRING, id INT8 NOT NULL PRIMARY KEY ) ``` After: ``` CREATE TABLE t ( name STRING, id INT8 NOT NULL PRIMARY KEY ) ``` 36852: changefeedccl: switch retryable errors back to a whitelist r=nvanbenschoten a=danhhz For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been failing because an error that should be marked as retryable wasn't. As a result of the discussion in #35974, I tried switching from a whitelist (retryable error) to a blacklist (terminal error) in #36132, but on reflection this doesn't seem like a great idea. We added a safety net to prevent false negatives from retrying indefinitely but it was immediately apparent that this meant we needed to tune the retry loop parameters. Better is to just do the due diligence of investigating the errors that should be retried and retrying them. The commit is intended for backport into 19.1 once it's baked for a bit. Closes #35974 Closes #36018 Closes #36019 Closes #36432 Release note (bug fix): `CHANGEFEED` now retry instead of erroring in more situations 36872: coldata: fix Slice when slicing up to batch.Length() r=yuzefovich a=asubiotto A panic occured because we weren't treating the end slice index as exclusive, resulting in an out of bounds panic when attempting to slice the nulls slice. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com> Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
For a while, the cdc/crdb-chaos and cdc/sink-chaos roachtests have been failing because an error that should be marked as retryable wasn't. As a result of the discussion in cockroachdb#35974, I tried switching from a whitelist (retryable error) to a blacklist (terminal error) in cockroachdb#36132, but on reflection this doesn't seem like a great idea. We added a safety net to prevent false negatives from retrying indefinitely but it was immediately apparent that this meant we needed to tune the retry loop parameters. Better is to just do the due diligence of investigating the errors that should be retried and retrying them. The commit is intended for backport into 19.1 once it's baked for a bit. Closes cockroachdb#35974 Closes cockroachdb#36018 Closes cockroachdb#36019 Closes cockroachdb#36432 Release note (bug fix): `CHANGEFEED` now retry instead of erroring in more situations
SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog
The text was updated successfully, but these errors were encountered: