Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade: don't blindly run permanent migrations #93071

Merged

Conversation

andreimatei
Copy link
Contributor

This patch makes the "primordial" permanent upgrades not run in clusters
being upgraded from 22.2 or lower.

Startup migrations were recently replaced with "permanent upgrades", in #91627.
The permanent new upgrades were running on node startup regardless of
whether it was a "new cluster" or an existing one - i.e. regardless of
whether the previous startupmigrations had also run, on the argument
that startupmigrations and upgrades are supposed to be idempotent
anyway. But, alas, there's idempotency and then there's idempotency:
startupmigrations/upgrades are supposed to be OK with running multiple
times at the same BinaryVersion in a "new cluster", but not necessarily
across BinaryVersions and in a cluster that has been in use. There are
different things that can go wrong when one of these migrations runs
multiple times in these general conditions, because the migration
might makes implicit assumptions about the schema and data of the
cluster it's running in:

  1. A migration might assume that all the data in a system table was
    created before a CRDB version changed some semantics. For example,
    the migration deleted in upgrades: remove upgrade granting CREATELOGIN role opt #92597 was assuming that CREATEROLE role
    option had a certain meaning for all the users who have it and was doing
    a backfill across all these users. Running this migration again at an
    arbitrary later point would affect unintended new users.
  2. A migration assumes the schema of system tables it operates on. As
    this schema (which is part of the bootstrap image for new cluster)
    changes with new versions, the code of the upgrade changes also. This
    means that the upgrade is no longer suitable for running against
    clusters that don't have the upgraded schema - i.e. clusters that have
    not yet been upgraded to the latest schema. This is a real problem
    because we're adding a column to system.users, and the addRootUser
    upgrade can no longer run against 22.2 clusters.

This patch guards all upgrades on checking that the old corresponding
startupmigration has not run, thus enabling the mentioned system schema
change in 23.1.

Release note: None
Epic: None

@andreimatei andreimatei requested review from ajwerner and a team December 5, 2022 20:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

cc @andyyang890

@andreimatei andreimatei force-pushed the upgrades.check-startupmigrations branch from 169e3c8 to 8ed6a98 Compare December 8, 2022 20:34
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 10 of 10 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/upgrade/doc.go line 13 at r1 (raw file):

package upgrade

// Package upgrade contains infrastructure for migrating a cluster from a

nit: this reads like a doc comment, why not put it above package upgrade?

This patch makes the "primordial" permanent upgrades not run in clusters
being upgraded from 22.2 or lower.

Startup migrations were recently replaced with "permanent upgrades", in cockroachdb#91627.
The permanent new upgrades were running on node startup regardless of
whether it was a "new cluster" or an existing one - i.e. regardless of
whether the previous startupmigrations had also run, on the argument
that startupmigrations and upgrades are supposed to be idempotent
anyway. But, alas, there's idempotency and then there's idempotency:
startupmigrations/upgrades are supposed to be OK with running multiple
times at the same BinaryVersion in a "new cluster", but not necessarily
across BinaryVersions and in a cluster that has been in use. There are
different things that can go wrong when one of these migrations runs
multiple times in these general conditions, because the migration
might makes implicit assumptions about the schema and data of the
cluster it's running in:
1. A migration might assume that all the data in a system table was
   created before a CRDB version changed some semantics. For example,
   the migration deleted in cockroachdb#92597 was assuming that `CREATEROLE` role
   option had a certain meaning for all the users who have it and was doing
   a backfill across all these users. Running this migration again at an
   arbitrary later point would affect unintended new users.
2. A migration assumes the schema of system tables it operates on. As
   this schema (which is part of the bootstrap image for new cluster)
   changes with new versions, the code of the upgrade changes also. This
   means that the upgrade is no longer suitable for running against
   clusters that don't have the upgraded schema - i.e. clusters that have
   not yet been upgraded to the latest schema. This is a real problem
   because we're adding a column to system.users, and the addRootUser
   upgrade can no longer run against 22.2 clusters.

This patch guards all upgrades on checking that the old corresponding
startupmigration has not run, thus enabling the mentioned system schema
change in 23.1.

Release note: None
Epic: None
Before this patch, if a node encountered a job corresponding to an
upgrade for a version that does not have a registered upgrade, it would
declare the job to be successful. This is bad; we can't just declare
that an upgrade has run when, in fact, it hasn't. This patch turns this
condition into a job error. This situation should not arise; it
indicates an incompatibility between the binary versions: one node has
an upgrade, another one doesn't, and yet both binaries are running at
a cluster version above the respective upgrade.

Release note: None
@andreimatei andreimatei force-pushed the upgrades.check-startupmigrations branch from 8ed6a98 to 90dcb5c Compare December 8, 2022 21:34
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/upgrade/doc.go line 13 at r1 (raw file):

Previously, ajwerner wrote…

nit: this reads like a doc comment, why not put it above package upgrade?

done

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! Just had one small style comment.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @andreimatei)


pkg/upgrade/doc.go line 73 at r3 (raw file):

bootstrap schema that has "foo".
*/

nit: there should be no newline between the package comment and the declaration

@craig
Copy link
Contributor

craig bot commented Dec 8, 2022

Build succeeded:

@craig craig bot merged commit 223f580 into cockroachdb:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants