Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade: don't blindly run permanent migrations #93071

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pkg/clusterversion/cockroach_versions.go
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,10 @@ const (
VPrimordial4
VPrimordial5
VPrimordial6
// NOTE(andrei): Do not introduce new upgrades corresponding to VPrimordial
// versions. Old-version nodes might try to run the jobs created for such
// upgrades, but they won't know about the respective upgrade, causing the job
// to succeed without actually performing the update.
VPrimordialMax

// V22_1 is CockroachDB v22.1. It's used for all v22.1.x patch releases.
Expand Down
1 change: 1 addition & 0 deletions pkg/upgrade/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ load("@io_bazel_rules_go//go:def.bzl", "go_library")
go_library(
name = "upgrade",
srcs = [
"doc.go",
"helpers.go",
"system_upgrade.go",
"tenant_upgrade.go",
Expand Down
74 changes: 74 additions & 0 deletions pkg/upgrade/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
// Copyright 2022 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

/*
Package upgrade contains infrastructure for migrating a cluster from a version
to another. It contains "upgrades" associated to different cluster versions that
are run when the cluster is upgraded: the upgrade for version v is run when the
cluster version is bumped from a version < v to one >= v. The code for the
upgrade corresponding to version v can be deleted when compatibility with
version v is no longer needed.

There is also a special category of upgrades called "permanent upgrades".
Permanent upgrades are associated with a cluster version (just like the
non-permanent ones), but, in addition to running when the cluster version is
bumped above their version, also run when the cluster is created at a version >=
v. They are called "permanent" because these upgrades cannot be deleted as soon
as compatibility with version v is no longer needed. Permanent upgrades can be
used to perform setup needed by the cluster, but that cannot be included in the
cluster's bootstrap image. For example, let's say that we have upgrades
associated to the following cluster versions:

v0.0.2 - permanent
v0.0.3 - permanent
v23.1.1
v23.1.5 - permanent
v23.1.7
v23.2.0
v23.2.1 - permanent
v23.2.2
v23.2.3
v24.1.0

and let's futher say that we're creating the cluster at version v23.2.0. On
cluster creation, we'll run all permanent upgrades <= v23.2.0, and none of the
non-permanent ones. Then, let's say that all the nodes are upgraded to
BinaryVersion=v24.1.0 binaries (and automatic version upgrades are turned off);
the cluster logical version remains v23.2.0. No upgrades are run. Then automatic
upgrades are turned on, and `SET CLUSTER VERSION 24.1.0` is run in the
background. At this point, all permanent and non-permanent upgrades > 23.2.0 and
<= 24.1.0 are run, in order.

Upgrades are run inside jobs, with one job per upgrade. A single job is ever
created for a particular upgrade, across the cluster - i.e. nodes coordinate
with one another when creating these jobs. Once an upgrade finishes, a row is
left in system.migrations.

Upgrades need to be idempotent: they might be run multiple times as the jobs
error or nodes crash in the middle of running one of the jobs. However, if an
upgrade has been run successfully by a binary with BinaryVersion=b, it is not
run again by a binary with a different BinaryVersion. This is a useful guarantee
for permanent upgrades, as it allows the code for an upgrade to change between
versions (for example in response to updated bootstrap schema), without needing
to worry about making the upgrade work for cluster being upgraded. Consider the
following example: say v23.1 adds column "foo" to table system.users. This is
done by adding the column to the bootstrap schema, and by creating an upgrade
corresponding to version 23.1.<foo> doing the `ALTER TABLE` for old cluster, and
probably a backfill for the table rows. Say that a permanent upgrade like
`addRootUser()` has code inserting into system.users. If it were not for this
guarantee, the `addRootUser()` code could not be changed to do:
INSERT INTO system.users (..., foo) ....
If this query was run in a cluster being upgraded from 22.2 to 23.1, it would
fail because the column doesn't exist yet (as the upgrade adding it has not yet
run). This guarantee says that, if addRootUser() runs, it can expect the
bootstrap schema that has "foo".
*/

package upgrade
44 changes: 33 additions & 11 deletions pkg/upgrade/migrationstable/migrations_table.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,44 @@ import (
func MarkMigrationCompleted(
ctx context.Context, ie sqlutil.InternalExecutor, v roachpb.Version,
) error {
return markMigrationCompletedInner(ctx, ie, v, false /* ignoreExisting */)
}

// MarkMigrationCompletedIdempotent is like MarkMigrationCompleted, except it
// can be called multiple times (or by multiple nodes, concurrently) without
// failing if the respective row already exists. If the row already exists, is
// not changed.
func MarkMigrationCompletedIdempotent(
ctx context.Context, ie sqlutil.InternalExecutor, v roachpb.Version,
) error {
return markMigrationCompletedInner(ctx, ie, v, true /* ignoreExisting */)
}

func markMigrationCompletedInner(
ctx context.Context, ie sqlutil.InternalExecutor, v roachpb.Version, ignoreExisting bool,
) error {
query := `
INSERT
INTO system.migrations
(
major,
minor,
patch,
internal,
completed_at
)
VALUES ($1, $2, $3, $4, $5)
`
if ignoreExisting {
query += "ON CONFLICT DO NOTHING"
}

_, err := ie.ExecEx(
ctx,
"migration-job-mark-job-succeeded",
nil, /* txn */
sessiondata.NodeUserSessionDataOverride,
`
INSERT
INTO system.migrations
(
major,
minor,
patch,
internal,
completed_at
)
VALUES ($1, $2, $3, $4, $5)`,
query,
v.Major,
v.Minor,
v.Patch,
Expand Down
9 changes: 5 additions & 4 deletions pkg/upgrade/system_upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -140,13 +140,14 @@ func NewSystemUpgrade(description string, v roachpb.Version, fn SystemUpgradeFun
// "permanent": an upgrade that will run regardless of the cluster's bootstrap
// version.
func NewPermanentSystemUpgrade(
description string, v roachpb.Version, fn SystemUpgradeFunc,
description string, v roachpb.Version, fn SystemUpgradeFunc, v22_2StartupMigrationName string,
) *SystemUpgrade {
return &SystemUpgrade{
upgrade: upgrade{
description: description,
v: v,
permanent: true,
description: description,
v: v,
permanent: true,
v22_2StartupMigrationName: v22_2StartupMigrationName,
},
fn: fn,
}
Expand Down
9 changes: 5 additions & 4 deletions pkg/upgrade/tenant_upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,13 +107,14 @@ func NewTenantUpgrade(
// NewPermanentTenantUpgrade constructs a TenantUpgrade marked as "permanent":
// an upgrade that will run regardless of the cluster's bootstrap version.
func NewPermanentTenantUpgrade(
description string, v roachpb.Version, fn TenantUpgradeFunc,
description string, v roachpb.Version, fn TenantUpgradeFunc, v22_2StartupMigrationName string,
) *TenantUpgrade {
m := &TenantUpgrade{
upgrade: upgrade{
description: description,
v: v,
permanent: true,
description: description,
v: v,
permanent: true,
v22_2StartupMigrationName: v22_2StartupMigrationName,
},
fn: fn,
precondition: nil,
Expand Down
13 changes: 13 additions & 0 deletions pkg/upgrade/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ type upgrade struct {
// baked into the bootstrap image and need to be run on new clusters
// regardless of the cluster's bootstrap version.
permanent bool
// v22_2StartupMigrationName, if set, is the name of the corresponding
// startupmigration in 22.2. In 23.1, we've turned these startupmigrations
// into permanent upgrades. We don't want to run the upgrade if the
// startupmigration had run.
v22_2StartupMigrationName string
}

// Version is part of the upgradebase.Upgrade interface.
Expand All @@ -66,3 +71,11 @@ func (m *upgrade) Permanent() bool {
func (m *upgrade) Name() string {
return fmt.Sprintf("Upgrade to %s: %q", m.v.String(), m.description)
}

// V22_2StartupMigrationName is part of the upgradebase.Upgrade interface.
func (m *upgrade) V22_2StartupMigrationName() string {
if !m.permanent {
panic("V22_2StartupMigrationName() called on non-permanent upgrade.")
}
return m.v22_2StartupMigrationName
}
4 changes: 4 additions & 0 deletions pkg/upgrade/upgradebase/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,8 @@ type Upgrade interface {
// not baked into the bootstrap image and need to be run on new cluster
// regardless of the cluster's bootstrap version.
Permanent() bool
// V22_2StartupMigrationName represents the name of the old pre-23.1
// startupmigration corresponding to this permanent upgrade. Calling this on
// non-permanent upgrades panics.
V22_2StartupMigrationName() string
}
8 changes: 1 addition & 7 deletions pkg/upgrade/upgradejob/upgrade_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,7 @@ func (r resumer) Resume(ctx context.Context, execCtxI interface{}) error {
mc := execCtx.MigrationJobDeps()
m, ok := mc.GetUpgrade(v)
if !ok {
// TODO(ajwerner): Consider treating this as an assertion failure. Jobs
// should only be created for a cluster version if there is an associated
// upgrade. It seems possible that an upgrade job could be launched by
// a node running a older version where an upgrade then runs on a job
// with a newer version where the upgrade has been re-ordered to be later.
// This should only happen between alphas but is theoretically not illegal.
return nil
return errors.AssertionFailedf("found job for unknown upgrade at version: %s", v)
}
switch m := m.(type) {
case *upgrade.SystemUpgrade:
Expand Down
60 changes: 60 additions & 0 deletions pkg/upgrade/upgrademanager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -216,10 +216,53 @@ func (m *Manager) RunPermanentUpgrades(ctx context.Context, upToVersion roachpb.
return err
}
if lastUpgradeCompleted {
log.Infof(ctx,
"detected the last permanent upgrade (v%s) to have already completed; no permanent upgrades will run",
lastVer)
return nil
} else {
// The stale read said that the upgrades had not run. Let's try a consistent read too since
log.Infof(ctx,
"the last last permanent upgrade (v%s) does not appear to have completed; attempting to run all upgrades",
lastVer)
}

for _, u := range permanentUpgrades {
// Check whether a 22.2 or older node has ran the old startupmigration
// corresponding to this upgrade. If it has, we don't want to run this
// upgrade, for two reasons:
// 1. Creating a job for this upgrade would be dubious. If the respective
// job were to be adopted by a 22.2 node, that node would fail to find an
// upgrade for the respective version, and would declare the job to be
// successful even though the upgrade didn't run. Even though that would
// technically be OK, since the existence of a 22.2 node running the job
// implies that the startupmigration had been run, it's confusing at the
// very least.
// 2. The upgrade can fail if the corresponding startupmigration had already
// run. This is because the upgrade is assuming that the cluster has been
// bootstrapped at the current binary version, with the current schema
// for system tables. See a discussion in upgrade/doc.go about why that
// would be.
//
// TODO(andrei): Get rid of this once compatibility with 22.2 is not necessary.
startupMigrationAlreadyRan, err := checkOldStartupMigrationRan(
ctx, u.V22_2StartupMigrationName(), m.deps.DB, m.codec)
if err != nil {
return err
}
if startupMigrationAlreadyRan {
log.Infof(ctx,
"skipping permanent upgrade for v%s because the corresponding startupmigration "+
"was already run by a v22.2 or older node",
u.Version())
// Mark the upgrade as completed so that we can get rid of this logic when
// compatibility with 22.2 is no longer necessary.
if err := migrationstable.MarkMigrationCompletedIdempotent(ctx, m.ie, u.Version()); err != nil {
return err
}
continue
}

log.Infof(ctx, "running permanent upgrade for version %s", u.Version())
if err := m.runMigration(ctx, u, user, u.Version(), !m.knobs.DontUseJobs); err != nil {
return err
Expand All @@ -228,6 +271,23 @@ func (m *Manager) RunPermanentUpgrades(ctx context.Context, upToVersion roachpb.
return nil
}

// Check whether this is a cluster upgraded from a pre-23.1 version and the
// old startupmigration with the given name has run. If it did, the
// corresponding upgrade should not run.
func checkOldStartupMigrationRan(
ctx context.Context, migrationName string, db *kv.DB, codec keys.SQLCodec,
) (bool, error) {
if migrationName == "" {
return false, nil
}
migrationKey := append(codec.StartupMigrationKeyPrefix(), roachpb.RKey(migrationName)...)
kv, err := db.Get(ctx, migrationKey)
if err != nil {
return false, err
}
return kv.Exists(), nil
}

// runPermanentMigrationsWithoutJobsForTests runs all permanent migrations up to
// VPrimordialMax. They are run without jobs, in order to minimize the side
// effects left on cluster.
Expand Down
19 changes: 0 additions & 19 deletions pkg/upgrade/upgrades/permanent_upgrades.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,25 +66,6 @@ func optInToDiagnosticsStatReporting(
if cluster.TelemetryOptOut() {
return nil
}

// Check whether this is a cluster upgraded from a pre-23.1 version and the
// old startupmigration dealing with enabling diagnostics has already run. If
// it did, there's nothing for us to do - in particular, we don't want to
// enable diagnostics if the cluster was created with the TelemetryOptOut()
// env var.
{
codec := deps.Codec
oldMigrationName := "enable diagnostics reporting"
migrationKey := append(codec.StartupMigrationKeyPrefix(), roachpb.RKey(oldMigrationName)...)
kv, err := deps.DB.Get(ctx, migrationKey)
if err != nil {
return err
}
if kv.Exists() {
return nil
}
}

_, err := deps.InternalExecutor.Exec(
ctx, "optInToDiagnosticsStatReporting", nil, /* txn */
`SET CLUSTER SETTING diagnostics.reporting.enabled = true`)
Expand Down
8 changes: 8 additions & 0 deletions pkg/upgrade/upgrades/upgrades.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,33 +57,41 @@ var upgrades = []upgradebase.Upgrade{
"add users and roles",
toCV(clusterversion.VPrimordial1),
addRootUser,
// v22_2StartupMigrationName - this upgrade corresponds to 3 old
// startupmigrations, out of which "make root a member..." is the last one.
"make root a member of the admin role",
),
upgrade.NewPermanentTenantUpgrade(
"enable diagnostics reporting",
toCV(clusterversion.VPrimordial2),
optInToDiagnosticsStatReporting,
"enable diagnostics reporting", // v22_2StartupMigrationName
),
upgrade.NewPermanentSystemUpgrade(
"populate initial version cluster setting table entry",
toCV(clusterversion.VPrimordial3),
populateVersionSetting,
"populate initial version cluster setting table entry", // v22_2StartupMigrationName
),
upgrade.NewPermanentTenantUpgrade(
"initialize the cluster.secret setting",
toCV(clusterversion.VPrimordial4),
initializeClusterSecret,
"initialize cluster.secret", // v22_2StartupMigrationName
),
upgrade.NewPermanentTenantUpgrade(
"update system.locations with default location data",
toCV(clusterversion.VPrimordial5),
updateSystemLocationData,
"update system.locations with default location data", // v22_2StartupMigrationName
),
// Introduced in v2.1.
// TODO(knz): bake this migration into v19.1.
upgrade.NewPermanentTenantUpgrade(
"create default databases",
toCV(clusterversion.VPrimordial6),
createDefaultDbs,
"create default databases", // v22_2StartupMigrationName
),
upgrade.NewTenantUpgrade(
"ensure preconditions are met before starting upgrading to v22.2",
Expand Down