Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: cluster fails to reboot, "no allowed privileges found for system object" after restart with wrong binary #40702

Closed
knz opened this issue Sep 12, 2019 · 4 comments

Comments

@knz
Copy link
Contributor

knz commented Sep 12, 2019

tldr: if I initialize a cluster with the latest master code, then try to restart with the master code from a few days ago, the cluster becomes hosed.

I think this is indicative of improper versioning somewhere. There should be guardrails against this mistake when made by users.

Observations

This is a freshly created cluster, created with roachprod:

$ roachprod create knz-aws-medium-h -u knz -c aws --aws-machine-type-ssd=c5d.4xlarge --local-ssd-no-ext4-barrier -n 7
# this installs binary from revision 99bed2718a9b5d1c6b94ce0d65dc40d1e746de16
$ roachprod stage knz-aws-medium-h cockroach 
$ roachprod start knz-aws-medium-h:1-6 --tag node
# (at this point the nodes are started OK)
$ roachprod stop knz-aws-medium-h:1-6 --tag node --sig 15 # notice the graceful stop
# staging a binary from revision 47bb2a58c87fc1259291ec9dde78de3e54bd8a3d
$ roachprod put knz-aws-medium-h:1-6 <my-binary> ./cockroach
$ roachprod start knz-aws-medium-h:1-6 --tag node

Here some the nodes fail, for example like this:

$ roachprod status knz-aws-medium-h:1-6
knz-aws-medium-h: status 6/6
   1: not running
   2: not running
   3: cockroach-v19.2.0-alpha.00000000-2955-gda20f35 3879
   4: cockroach-v19.2.0-alpha.00000000-2955-gda20f35 3920
   5: cockroach-v19.2.0-alpha.00000000-2955-gda20f35 3871
   6: cockroach-v19.2.0-alpha.00000000-2955-gda20f35 3818

Relevant log lines on node 1:

F190912 11:03:56.693471 57 server/server.go:1559  [n1] no allowed privileges found for system object with ID=25
failed to run migration "repeat: ensure admin role privileges in all descriptors"
github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations
        /go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:503
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1553
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3.2
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:698
github.com/cockroachdb/cockroach/pkg/cli.runStart.func3
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:813
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1337

Attached logs for the entire cluster.
init-fail.zip

I also have the data directories if relevant.

After I shut down the remaining nodes, if I try to restart the cluster, other nodes fail to come up, with the same error.

@knz
Copy link
Contributor Author

knz commented Sep 12, 2019

cc @dt for triage due to the error encountered

@knz
Copy link
Contributor Author

knz commented Sep 12, 2019

For reference here's the list of diffs between the two versions:
99bed27 (origin/master, origin/HEAD) Merge #40686
1765f24 Merge #40692
270caf9 Add Céline to AUTHORS
751476c Merge #40685
b531642 Merge #40658 #40673
8f439c4 Merge #40684
f24615a Merge #40644
a37fe87 opt: inline single-use WithExprs
8190ae3 exec: add nulls injection test to runTests harness
acdabf6 Merge #40539
b42bfc7 roachtest: stop skipping jepsen/multi-register
10340b8 roachpb: include sticky bit in RangeDescriptor.String output
bce673e Merge #40583 #40669
c2aa390 roachtest: fix import/tpch roachtest for release-19.1
a8e1ddd storage: don't use caller's context in {Maybe,Add}Async
63b31df opt: fix execbuilder output column count for lookup joins
3593a11 Merge #40615
4146821 Merge #40645
37b76c1 exec: add unit test for AND operator
10193ad exec: make sure that AND operator pays attention to whether a value is null
fc9eb68 Merge #40650
832b2bd Merge #40647
65c5e37 sql: backward compat for ALTER PARTITION OF TABLE
bdf41f7 Merge #40651
5bfc009 Merge #40646
38853eb exec: fix a bug with ON expression support for joiners
7bb6277 rocksdb/table: Always check key ordering when inserting to an SST
5e9c959 ui: update wording of tooltip for transactions chart
4963b60 storage: don't panic on absent store in StorePool.getStoreListFromIDs
c3e6ff4 opt: fix memory corruption causing stack overflow
0f9c8da Merge #40620
99711d9 coldata: bugfix to coldata.Copy with NULL and SelOnDest
78599a1 Merge #40630
10f81dd distsqlrun: avoid flow.Waiting if called after a panic
a41355d Merge #40616
3ad8bda Merge #40626
bdb9b87 sql: fix getPlanColumns for hookFnNode
404cbf9 Merge #40609 #40617 #40619
271afaa Merge #40450
4bcebaf cliccl: Change license acquisition URL
7af579c sql: Fixing show partitions and show ranges to work on not current database
9ccece9 Merge #40439
a46de66 jobs: ensure that async index dropping is needed
ba30aa9 Merge #40625
91bb588 storage: introduce a couple of replication reports
fb6a397 Merge #40593
ae1e1e5 exec: add UnsetNulls to ResetInternalBatch
2f9f441 exec: avoid calling ResetInternalBatch twice
e4da4b3 exec: return a copy of results in aggregator when results overflow
e4da4b3 exec: return a copy of results in aggregator when results overflow
de08130 sql/sem/builtins: fix width_bucket for 0-length arrays
1b75c93 Merge #40607
9834a49 Merge #40429
4de4455 Merge #40618
fa9e621 Merge #40340
8c60e97 exec: minor clean up
9c77823 distsqlrun: make router output respect memory limit setting
a454a44 distsqlrun: make windower respect the memory limits
3256e02 Merge #40595
8a1878a exec: don't template asc vs desc in mj
d4a730c opt: fix scalar building error handling
71f88c3 settings: allow overriding the default for settings
76d6324 settings: fix funky test
55294cf Merge #40610
50de16a Merge #40558
9f20b5c Merge #40523 #40601
7a382a2 exec: explicitly check for nulls in selBoolOp
0f736d4 Merge #40606
67a729b Merge #40464
b93a4ee storage: kv.atomic_replication_changes=true
a533375 storage: don't write nonzero sticky bit before 19.2
3d3f722 roachtest: update 19.1 hibernate blacklist
fac112b storage: manually increment clock in TestTxnRecordLifecycleTransitions
b943d81 storage: fix flake in TestTxnRecordLifecycleTransitions
0d357b7 coldata: respect SelOnDest flag when setting nulls
c3b82b0 Merge #40600 #40603
d20419d storage/engine: return WriteIntentError for intents in uncertainty intervals
df9963b sql: fix wrong representation of SHOW JOBS
300ae45 make: pass TESTFLAGS to roachprod-stress, not GOFLAGS
e82388f opt: fix functional deps and stats for WithScanExpr
746d213 Merge #40582
3db4006 Merge #40193
e0b2f34 exec: add inbox shutdown race test
71f14f3 sqlmigrations: remove ensureMaxPrivileges migration
b3ef24c roachtest: run fewer TPHC imports, increase timeout
de16457 sql: fix a funky zone config in a test

@knz
Copy link
Contributor Author

knz commented Sep 12, 2019

cc @lucy-zhang as dt is on vacay

@ajwerner
Copy link
Contributor

I'm closing this as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants