Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to crash crdb on DDL using cluster_logical_timestamp() #98269

Closed
sean- opened this issue Mar 9, 2023 · 6 comments · Fixed by #98696
Closed

Ability to crash crdb on DDL using cluster_logical_timestamp() #98269

sean- opened this issue Mar 9, 2023 · 6 comments · Fixed by #98696
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@sean-
Copy link
Collaborator

sean- commented Mar 9, 2023

Describe the problem

Possible to crash crdb using cluster_logical_timestamp() as a DEFAULT expression.

To Reproduce

CREATE TABLE t(i int);
INSERT INTO t VALUES(1);
ALTER TABLE t ADD COLUMN ts decimal NOT NULL DEFAULT cluster_logical_timestamp();

After the node comes back up, the schema appears to be in an inconsistent state.

root@localhost:26257/defaultdb> SELECT * FROM t;
  i
-----
  1
(1 row)
root@localhost:26257/defaultdb> ALTER TABLE t ADD COLUMN ts decimal NOT NULL DEFAULT cluster_logical_timestamp();
ERROR: column "ts" of relation "t" already exists
SQLSTATE: 42701

Environment:

  • CockroachDB version CockroachDB CCL v22.2.5 (aarch64-apple-darwin21.2, built 2023/02/16 16:37:38, go1.19.4)
  • Server OS: Darwin 22.3.0 Darwin Kernel Version 22.3.0: Mon Jan 30 20:38:37 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T6000 arm64

Jira issue: CRDB-25159

@sean- sean- added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 9, 2023
@sean- sean- changed the title Ability to crash crdb on DDL... Ability to crash crdb on DDL using cluster_logical_timestamp() Mar 9, 2023
@blathers-crl blathers-crl bot added T-sql-schema-deprecated Use T-sql-foundations instead T-sql-queries SQL Queries Team labels Mar 9, 2023
@ajwerner
Copy link
Contributor

The bigger deal is the uncaught panic which crashes the server. Related to #58164. The backfill code itself could catch some panics, but it's orchestrated through distsql, and it seems better to always catch the panics there.

@ajwerner ajwerner added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Mar 13, 2023
@ajwerner
Copy link
Contributor

ts := ec.Txn.CommitTimestamp()
if ts.IsEmpty() {
panic(errors.AssertionFailedf("zero cluster timestamp in txn"))
}

This code can panic in two places!

@ajwerner
Copy link
Contributor

@Xiang-Gu can you have a look at fixing the proximate problem (nil check in the above code) and then have the caller check for a nil value?

@sean-
Copy link
Collaborator Author

sean- commented Mar 13, 2023

After the panic, I'm unable to perform DDL on the table. For instance:

root@localhost:26257/defaultdb> ALTER TABLE t add COLUMN foo INT;

This just hangs. I suspect there are a few things going on, in addition to the panic.

@ajwerner
Copy link
Contributor

What's actually going on is that the schema change that crashed is still in the running state, but we have exponential backoff to deal with failures like this so it doesn't just keep killing nodes. You should be able to cancel the original job.

@ajwerner
Copy link
Contributor

I think pausing the job before canceling it will help clear the backoff.

Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Mar 15, 2023
…backfilling

Previously, `ADD COLUMN ... DEFAULT cluster_logical_timestamp()` would
crash the node and leave the table in a corrupt state. The root cause
is a nil pointer dereference. This commit fixed it by returning an
unimplemented error and hence disallow using this builtin function as
default value when backfilling.

Release note (bug fix): fixed a bug as detailed in cockroachdb#98269.
Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Mar 17, 2023
…backfilling

Previously, `ADD COLUMN ... DEFAULT cluster_logical_timestamp()` would
crash the node and leave the table in a corrupt state. The root cause
is a nil pointer dereference. This commit fixed it by returning an
unimplemented error and hence disallow using this builtin function as
default value when backfilling.

Release note (bug fix): fixed a bug as detailed in cockroachdb#98269.
@craig craig bot closed this as completed in ea389ff Mar 20, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Mar 20, 2023
Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Mar 27, 2023
…backfilling

Previously, `ADD COLUMN ... DEFAULT cluster_logical_timestamp()` would
crash the node and leave the table in a corrupt state. The root cause
is a nil pointer dereference. This commit fixed it by returning an
unimplemented error and hence disallow using this builtin function as
default value when backfilling.

Release note (bug fix): fixed a bug as detailed in cockroachdb#98269.
Release justification: Fixed a bug that can crash node.
Xiang-Gu added a commit to Xiang-Gu/cockroach that referenced this issue Mar 27, 2023
…backfilling

Previously, `ADD COLUMN ... DEFAULT cluster_logical_timestamp()` would
crash the node and leave the table in a corrupt state. The root cause
is a nil pointer dereference. This commit fixed it by returning an
unimplemented error and hence disallow using this builtin function as
default value when backfilling.

Release note (bug fix): fixed a bug as detailed in cockroachdb#98269.
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants