-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade: don't blindly run permanent migrations #93071
upgrade: don't blindly run permanent migrations #93071
Conversation
cc @andyyang890 |
169e3c8
to
8ed6a98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 10 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/upgrade/doc.go
line 13 at r1 (raw file):
package upgrade // Package upgrade contains infrastructure for migrating a cluster from a
nit: this reads like a doc comment, why not put it above package upgrade
?
This patch makes the "primordial" permanent upgrades not run in clusters being upgraded from 22.2 or lower. Startup migrations were recently replaced with "permanent upgrades", in cockroachdb#91627. The permanent new upgrades were running on node startup regardless of whether it was a "new cluster" or an existing one - i.e. regardless of whether the previous startupmigrations had also run, on the argument that startupmigrations and upgrades are supposed to be idempotent anyway. But, alas, there's idempotency and then there's idempotency: startupmigrations/upgrades are supposed to be OK with running multiple times at the same BinaryVersion in a "new cluster", but not necessarily across BinaryVersions and in a cluster that has been in use. There are different things that can go wrong when one of these migrations runs multiple times in these general conditions, because the migration might makes implicit assumptions about the schema and data of the cluster it's running in: 1. A migration might assume that all the data in a system table was created before a CRDB version changed some semantics. For example, the migration deleted in cockroachdb#92597 was assuming that `CREATEROLE` role option had a certain meaning for all the users who have it and was doing a backfill across all these users. Running this migration again at an arbitrary later point would affect unintended new users. 2. A migration assumes the schema of system tables it operates on. As this schema (which is part of the bootstrap image for new cluster) changes with new versions, the code of the upgrade changes also. This means that the upgrade is no longer suitable for running against clusters that don't have the upgraded schema - i.e. clusters that have not yet been upgraded to the latest schema. This is a real problem because we're adding a column to system.users, and the addRootUser upgrade can no longer run against 22.2 clusters. This patch guards all upgrades on checking that the old corresponding startupmigration has not run, thus enabling the mentioned system schema change in 23.1. Release note: None Epic: None
Before this patch, if a node encountered a job corresponding to an upgrade for a version that does not have a registered upgrade, it would declare the job to be successful. This is bad; we can't just declare that an upgrade has run when, in fact, it hasn't. This patch turns this condition into a job error. This situation should not arise; it indicates an incompatibility between the binary versions: one node has an upgrade, another one doesn't, and yet both binaries are running at a cluster version above the respective upgrade. Release note: None
8ed6a98
to
90dcb5c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/upgrade/doc.go
line 13 at r1 (raw file):
Previously, ajwerner wrote…
nit: this reads like a doc comment, why not put it above
package upgrade
?
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! Just had one small style comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @andreimatei)
pkg/upgrade/doc.go
line 73 at r3 (raw file):
bootstrap schema that has "foo". */
nit: there should be no newline between the package comment and the declaration
Build succeeded: |
This patch makes the "primordial" permanent upgrades not run in clusters
being upgraded from 22.2 or lower.
Startup migrations were recently replaced with "permanent upgrades", in #91627.
The permanent new upgrades were running on node startup regardless of
whether it was a "new cluster" or an existing one - i.e. regardless of
whether the previous startupmigrations had also run, on the argument
that startupmigrations and upgrades are supposed to be idempotent
anyway. But, alas, there's idempotency and then there's idempotency:
startupmigrations/upgrades are supposed to be OK with running multiple
times at the same BinaryVersion in a "new cluster", but not necessarily
across BinaryVersions and in a cluster that has been in use. There are
different things that can go wrong when one of these migrations runs
multiple times in these general conditions, because the migration
might makes implicit assumptions about the schema and data of the
cluster it's running in:
created before a CRDB version changed some semantics. For example,
the migration deleted in upgrades: remove upgrade granting CREATELOGIN role opt #92597 was assuming that
CREATEROLE
roleoption had a certain meaning for all the users who have it and was doing
a backfill across all these users. Running this migration again at an
arbitrary later point would affect unintended new users.
this schema (which is part of the bootstrap image for new cluster)
changes with new versions, the code of the upgrade changes also. This
means that the upgrade is no longer suitable for running against
clusters that don't have the upgraded schema - i.e. clusters that have
not yet been upgraded to the latest schema. This is a real problem
because we're adding a column to system.users, and the addRootUser
upgrade can no longer run against 22.2 clusters.
This patch guards all upgrades on checking that the old corresponding
startupmigration has not run, thus enabling the mentioned system schema
change in 23.1.
Release note: None
Epic: None