-
Notifications
You must be signed in to change notification settings - Fork 18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not yet completed the review - only skimmed over the first part of changes. I'll continue tomorrow.
# This list needs to include every version of etcd that we can upgrade from + latest | ||
ETCD_VER := v2.3.8 v3.3.4 | ||
# This is the version of etcd we should upgrade to (from the version list) | ||
ETCD_LATEST_VER := v3.3.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd swap the variable definitions to make it less error-prone:
# This is the version of etcd we should upgrade to (from the version list)
ETCD_LATEST_VER := v3.3.4
# ETCD Versions to include in the release
# This list needs to include every version of etcd that we can upgrade from + latest
ETCD_VER := v2.3.8 $(ETCD_LATEST_VER)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I would embed $(ETCD_LATEST_VER) in ETCD_VER, to me it adds some amount of risks, that if someone say bumps ETCD to v3.3.5, that they may forget to add it to the ETCD_VER list, everything would work but upgrades from that version may not.
The way it is now, if someone bumps ETCD_LATEST_VER to v3.3.5, but forgets to update ETCD_VER, the new release will clearly not work.
@@ -3,7 +3,7 @@ FROM planet/base | |||
ENV GOPATH /gopath | |||
ENV GOROOT /opt/go | |||
ENV PATH $PATH:$GOPATH/bin:$GOROOT/bin | |||
ENV GOVERSION 1.8.3 | |||
ENV GOVERSION 1.10.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need go1.10
for anything specifically or is it just a timely update?
Also, do you know if go1.10
is an umbrella binary for 1.10.x
or it is the 1.10.0
? If they could release from release-branch.go1.10
as capturing the latest patch (i.e. 1.10.2
atm) but not sure if they actually do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Etcd requires go 1.9, but I bumped to 1.10 since I was updating the version anyways: etcd-io/etcd#8548 (comment)
As for an umbrella binary, I don't know, I was sticking with the full version, as I suspect that's less likely for an upstream change to affect us.
ln -sf /dev/null $(ROOTFS)/etc/systemd/system/etcd-upgrade.service | ||
|
||
# Write to the release file to indicate the latest release | ||
echo PLANET_ETCD_VERSION=$(ETCD_LATEST_VER) >> $(ROOTFS)/etc/planet-release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is appending to planet-release
intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right now it's only for etcd, but strictly speaking, if we were to embed version information for other dependencies, we could also include additional version information.
build.assets/makefiles/etcd/etcd.mk
Outdated
|
||
.PHONY: $(ETCD_VER) | ||
$(ETCD_VER): | ||
@echo "\n---> $@ - Downloading etcd\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo -e
to properly interpret escape sequences
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that applies to Makefile builtin @echo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a builtin and what make does is it runs the echo
command with the shell. The shell is /bin/sh
by default but can be overridden with the SHELL
envvar.
Now that I'm thinking about it, a better solution would be, for instance, to default SHELL
to /bin/bash
and use -e
to avoid inconsistencies across environments:
With this makefile as an example:
.PHONY: all
all:
@echo "\n --> I'm a bad boy\n"
@echo -e "\n --> I'm a good boy\n"
you get either:
10:18 $ make
--> I'm a bad boy
-e
--> I'm a good boy
or
10:18 $ make
\n --> I'm a bad boy\n
--> I'm a good boy
(the last being also the result of using /bin/bash
as a shell since bash
only interprets escape sequence provided -e
).
The results are different when I make locally or on jenkins (on jenkins, /bin/sh
has been symlinked to /bin/bash
)
tool/planet/cfg.go
Outdated
@@ -70,6 +70,8 @@ type Config struct { | |||
// EtcdInitialCluster configures the value of ETCD_INITIAL_CLUSTER environment variable | |||
// inside the container | |||
EtcdInitialCluster string | |||
//EtcdGatewayList is a list of etcd endpoints that the etcd gateway can use to reach the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// EtcdGatewayList
(a whitespace between commant and the sentence start)
tool/planet/etcd.go
Outdated
return nil | ||
} | ||
|
||
// etcdDisable disabled etcd on this machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disables
tool/planet/etcd.go
Outdated
|
||
// etcdEnable enables a disabled etcd node | ||
func etcdEnable(upgradeService bool) error { | ||
if upgradeService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to test for reversed condition to keep code outdented:
func etcdEnable(upgradeService bool) error {
if !upgradeService {
return trace.Wrap(enableService(ETCDServiceName))
}
// don't actually enable the service if this is a proxy
env, err := box.ReadEnvironment(ContainerEnvironmentFile)
if err != nil {
return trace.Wrap(err)
}
if env.Get(EnvEtcdProxy) == EtcdProxyOn {
log.Infof("etcd is in proxy mode, nothing to do")
return nil
}
return trace.Wrap(enableService(ETCDUpgradeServiceName))
}
tool/planet/etcd.go
Outdated
} | ||
|
||
if env.Get(EnvEtcdProxy) == EtcdProxyOn { | ||
log.Infof("etcd is in proxy mode, nothing to do") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for formatting version Infof
w/o placeholders
tool/planet/etcd.go
Outdated
// this only happens if the service is already running | ||
status, err := getServiceStatus(APIServerServiceName) | ||
if err != nil { | ||
return trace.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to fail the upgrade if you failed to query the status of the API server - it does not look like it's critical to the upgrade operation?
Otherwise you might need to undo a few things to be able to retry (though I'm not sure yet this is possible).
tool/planet/etcd.go
Outdated
} | ||
|
||
func etcdRestore(file string) error { | ||
ctx := context.TODO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here: I'd accept the context from the outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not completely sure I agree with you here in this specific case. etcdBackup / etcdRestore come from the dispatch from main based on the kingpin function, where we may not want to use the same context based on which cli command we're in.
However, this comes more from the philosophy of cobra, where the library does the dispatch for you, and the function for the command is really the entrypoint. Example: https://github.com/gravitational/etcd-backup/blob/master/cmd/etcd-backup/backup.go
# instead of the etcd service | ||
|
||
[Service] | ||
ExecStart= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra ExecStart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to reset the previous value (here is a faint hint at resetting the values of a list attribute before setting a new one - look at the end of the document).
tool/planet/etcd.go
Outdated
func readEtcdVersion(path string) (string, error) { | ||
inFile, err := os.Open(path) | ||
if err != nil { | ||
return "", trace.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trace.ConvertSystemError
tool/planet/etcd.go
Outdated
func writeEtcdEnvironment(path string, version string) error { | ||
err := os.MkdirAll(filepath.Dir(path), 644) | ||
if err != nil { | ||
return trace.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trace.ConvertSystemError
tool/planet/etcd.go
Outdated
|
||
f, err := os.Create(path) | ||
if err != nil { | ||
return trace.Wrap(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trace.ConvertSystemError
tool/planet/etcd.go
Outdated
} | ||
|
||
func writeEtcdEnvironment(path string, version string) error { | ||
err := os.MkdirAll(filepath.Dir(path), 644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a constant for 644
?
tool/planet/etcd.go
Outdated
out, err := exec.CommandContext(ctx, "/bin/systemctl", "--no-block", operation, service).CombinedOutput() | ||
log.Infof("%v %v: %v", operation, service, string(out)) | ||
if err != nil { | ||
return trace.Wrap(err, fmt.Sprintf("failed to %v %v: %v", operation, service, string(out))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need fmt.Sprintf
here.
tool/planet/etcd.go
Outdated
from = DefaultEtcdStoreBackup | ||
to = DefaultEtcdStoreCurrent | ||
} | ||
err = os.RemoveAll(to) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's dangerous. I think this logic needs a better protection against accidental screw-ups - otherwise if "upgrade" somehow runs twice without "rollback", it will nuke the only backup on the second run.
Maybe it can append some ID (e.g. upgrade operation ID) to the backup dir to make it per-operation unique and instead of happily deleting it, refuse to proceed if it's not empty (thus, mandating rollback first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if you run the step twice, it will early exit before this is hit, due to CurrentVersion == Desired version.
However, if it fails part way through the step, I think the filesystem could be inconsistent that would allow it to continue and delete the data. I'm not sure appending an ID really helps here, I think it would be susceptible to similar problems or would just leave a bunch of data lying around (which some customers are having trouble with as is). That said, I think it should be possible to make this code re-entrant, so that if it's run multiple times it will skip steps already completed and continue where it left off.
// symlink /usr/bin/etcd to the version we expect to be running | ||
for _, path := range []string{"/usr/bin/etcd", "/usr/bin/etcdctl"} { | ||
// ignore the error from os.Remove, since we don't care if it fails | ||
_ = os.Remove(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of _ =
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought using _ = was a pattern for telling linters (and other devs) I'm intentionally swallowing the error.
tool/planet/etcd.go
Outdated
if _, err := os.Stat(path); err == nil { | ||
return readEtcdVersion(path) | ||
} | ||
return AssumeEtcdVersion, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at how this function is used and TBH the fact that it falls back to this "assumed" version makes the logic a bit harder to follow (for example, in etcdInit()
). Also, IIUC it will still return 2.3.8 initially for new installs even though they're running 3.x? I would just have it return NotFound and handle on the caller side.
tool/planet/etcd.go
Outdated
} | ||
log.Info("Desired etcd version: ", desiredVersion) | ||
|
||
if currentVersion == AssumeEtcdVersion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was talking about in the currentEtcdVersion()
method comment (see somewhere below). It returned 2.3.8 while in fact it's not.
tool/planet/etcd_test.go
Outdated
func (*EtcdSuite) TestEtcdAssumedVersion(c *check.C) { | ||
ver, err := currentEtcdVersion("/this/file/doesnt/exist") | ||
if err != nil { | ||
c.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a longer version of:
c.Assert(err, IsNil)
tool/planet/etcd_test.go
Outdated
func (*EtcdSuite) TestEtcdParseFile(c *check.C) { | ||
file, err := ioutil.TempFile(os.TempDir(), "prefix") | ||
if err != nil { | ||
c.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.Assert(err, IsNil)
? This will also mark the test as failed.
tool/planet/etcd_test.go
Outdated
|
||
// reading an empty file should return an error | ||
ver, err := currentEtcdVersion(file.Name()) | ||
c.Assert(err, check.NotNil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for specific error expectation (type or a message).
tool/planet/etcd_test.go
Outdated
// reading a missing file should return an error | ||
ver, err = readEtcdVersion("/this/file/doesnt/exist") | ||
c.Assert(err, check.NotNil) | ||
c.Assert(ver, check.Equals, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worth testing.
tool/planet/etcd_test.go
Outdated
|
||
// write a version file then check it | ||
version := "1.1.1" | ||
file.WriteString(fmt.Sprintf("%v=%v", EnvEtcdVersion, version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not ignore errors here.
There's also a helper in fmt
:
fmt.Fprintf(file, "%v=%v", EnvEtcdVersion, version)
tool/planet/main.go
Outdated
@@ -161,6 +161,23 @@ func run() error { | |||
cetcdPromoteInitialCluster = cetcdPromote.Flag("initial-cluster", "Initial cluster, as output by 'member add' command").Required().String() | |||
cetcdPromoteInitialClusterState = cetcdPromote.Flag("initial-cluster-state", "Initial cluster state, as output by 'member add' command").Required().String() | |||
|
|||
cetcdInit = cetcd.Command("init", "setup etcd to run the correct version").Hidden() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setup etcd... - Capitalized as in other descriptions.
tool/planet/main.go
Outdated
cetcdEnable = cetcd.Command("enable", "Enable etcd on this node") | ||
cetcdEnableUpgrade = cetcdEnable.Flag("upgrade", "enable the upgrade service").Bool() | ||
|
||
cetcdUpgrade = cetcd.Command("upgrade", "Upgrade etcd to latest available in this container") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... latest version ...
tool/planet/main.go
Outdated
cetcdRollback = cetcd.Command("rollback", "Rollback etcd to the previous release") | ||
|
||
cetcdRestore = cetcd.Command("restore", "Restore etcd backup as part of the upgrade") | ||
cetcdRestoreFile = cetcdRestore.Arg("file", "A previously taken backup file to use during upgrade").Required().String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use kingpin's ExistingFile
mixin to enfocre this:
cetcdRestoreFile = cetcdRestore.Arg("file", "A previously taken backup file to use during upgrade").Required().ExistingFile()
tool/planet/start.go
Outdated
@@ -430,6 +437,35 @@ func addEtcdOptions(config *Config) { | |||
} | |||
} | |||
|
|||
// setupEtcd runs setup tasks for etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use dots to end sentences.
tool/planet/start.go
Outdated
return trace.Wrap(err) | ||
} | ||
} else { | ||
err := os.Remove(dropinPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for etcdproxy != "on"
instead and don't use the else
block as it is redundant:
func setupEtcd(config *Config) error {
dropinPath := path.Join(config.Rootfs, ETCDGatewayDropinPath)
if strings.ToLower(config.EtcdProxy) != "on" {
err := os.Remove(dropinPath)
if err != nil && !os.IsNotExist(err) {
return trace.Wrap(err)
}
return nil
}
err := os.MkdirAll(path.Join(config.Rootfs, "etc/systemd/system/etcd.service.d/"), 0755)
if err != nil {
return trace.Wrap(err)
}
err = os.Symlink(
"/lib/systemd/system/etcd-gateway.dropin",
dropinPath,
)
if err != nil && !os.IsExist(err) {
return trace.Wrap(err)
}
return nil
}
for _, service := range services { | ||
status, err := getServiceStatus(service) | ||
if err != nil { | ||
log.Warnf("Failed to query status of service %v. Continuing upgrade. Error: %v", service, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected scenario? When a service status can't be queried?
tool/planet/etcd.go
Outdated
} | ||
} | ||
} else { | ||
// in order to upgrade, write the new version to to disk with the current version as backup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to to" -> "to"
if err != nil { | ||
return trace.Wrap(err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need to clean up data that can possibly be left after an upgrade attempt on rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens on the next upgrade attempt:
Lines 315 to 320 in 182dff9
// wipe data directory of any previous upgrade attempt | |
path := path.Join(getBaseEtcdDir(desiredVersion), "member") | |
err = os.RemoveAll(path) | |
if err != nil && !os.IsNotExist(err) { | |
return trace.ConvertSystemError(err) | |
} |
tool/planet/etcd.go
Outdated
// wipe data directory of any previous upgrade attempt | ||
path := path.Join(getBaseEtcdDir(desiredVersion), "member") | ||
err = os.RemoveAll(path) | ||
if err != nil && !os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsNotExist check is redundant. From the docs: If the path does not exist, RemoveAll returns nil (no error).
.
|
||
// wipe old backups leftover from previous upgrades | ||
// Note: if this fails, but previous steps were successfull, the backups won't get cleaned up | ||
if backupVersion != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a version of a previous backup, not backup for this upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct.
} | ||
|
||
func writeEtcdEnvironment(path string, version string, prevVersion string) error { | ||
err := os.MkdirAll(filepath.Dir(path), 644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
644 -> constants.SharedReadMask
tool/planet/etcd.go
Outdated
// the problem is, when shutting down etcd, systemd will respond when the process has been told to shutdown | ||
// but this leaves a race, where we might be continuing while etcd is still cleanly shutting down | ||
func waitForEtcdStopped(ctx context.Context) error { | ||
tick := time.Tick(WaitInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ticker := time.NewTicker
instead and defer ticker.Close()
it. From the docs:
While Tick is useful for clients that have no need to shut down the Ticker, be aware that without a way to shut it down the underlying Ticker cannot be recovered by the garbage collector; it "leaks".
} | ||
|
||
func etcdRestore(file string) error { | ||
ctx := context.TODO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have all etcd*
functions accept context as a parameter, even if it'd be passed as context.TODO()
from main.go for now. This way it will be much easier to, for example, add a "--timeout" flag to all these commands so client can control it. Right now these internal context are pretty much useless.
tool/planet/etcd.go
Outdated
} | ||
|
||
log.Infof("removing %v", ETCDProxyDir) | ||
if err := os.RemoveAll(ETCDProxyDir); err != nil { | ||
if err := os.RemoveAll(ETCDProxyDir); err != nil && !os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsNotExist check is redundant.
tool/planet/etcd.go
Outdated
CAFile: DefaultEtcdctlCAFile, | ||
}, | ||
Prefix: []string{"/"}, // Restore all etcd data | ||
MigratePrefix: []string{"/registry"}, // migrate kubernetes data to etcd3 datastore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a constant for /registry
as far as I can see.
Issue: https://github.com/gravitational/gravity/issues/3121
This is the planet logic to implemented etcd upgrades. Let's chat about the approach before you spend too much time looking through the PR's.