Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

etcd3 #286

Merged
merged 68 commits into from
May 23, 2018
Merged

etcd3 #286

merged 68 commits into from
May 23, 2018

Conversation

knisbet
Copy link
Contributor

@knisbet knisbet commented May 7, 2018

Issue: https://github.com/gravitational/gravity/issues/3121

This is the planet logic to implemented etcd upgrades. Let's chat about the approach before you spend too much time looking through the PR's.

Kevin Nisbet added 30 commits April 10, 2018 15:43
@knisbet knisbet requested review from r0mant and a-palchikov May 7, 2018 14:36
Copy link
Contributor

@a-palchikov a-palchikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet completed the review - only skimmed over the first part of changes. I'll continue tomorrow.

# This list needs to include every version of etcd that we can upgrade from + latest
ETCD_VER := v2.3.8 v3.3.4
# This is the version of etcd we should upgrade to (from the version list)
ETCD_LATEST_VER := v3.3.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd swap the variable definitions to make it less error-prone:

# This is the version of etcd we should upgrade to (from the version list)
ETCD_LATEST_VER := v3.3.4
# ETCD Versions to include in the release
# This list needs to include every version of etcd that we can upgrade from + latest
ETCD_VER := v2.3.8 $(ETCD_LATEST_VER)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I would embed $(ETCD_LATEST_VER) in ETCD_VER, to me it adds some amount of risks, that if someone say bumps ETCD to v3.3.5, that they may forget to add it to the ETCD_VER list, everything would work but upgrades from that version may not.

The way it is now, if someone bumps ETCD_LATEST_VER to v3.3.5, but forgets to update ETCD_VER, the new release will clearly not work.

@@ -3,7 +3,7 @@ FROM planet/base
ENV GOPATH /gopath
ENV GOROOT /opt/go
ENV PATH $PATH:$GOPATH/bin:$GOROOT/bin
ENV GOVERSION 1.8.3
ENV GOVERSION 1.10.1
Copy link
Contributor

@a-palchikov a-palchikov May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need go1.10 for anything specifically or is it just a timely update?

Also, do you know if go1.10 is an umbrella binary for 1.10.x or it is the 1.10.0? If they could release from release-branch.go1.10 as capturing the latest patch (i.e. 1.10.2 atm) but not sure if they actually do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Etcd requires go 1.9, but I bumped to 1.10 since I was updating the version anyways: etcd-io/etcd#8548 (comment)

As for an umbrella binary, I don't know, I was sticking with the full version, as I suspect that's less likely for an upstream change to affect us.

ln -sf /dev/null $(ROOTFS)/etc/systemd/system/etcd-upgrade.service

# Write to the release file to indicate the latest release
echo PLANET_ETCD_VERSION=$(ETCD_LATEST_VER) >> $(ROOTFS)/etc/planet-release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is appending to planet-release intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right now it's only for etcd, but strictly speaking, if we were to embed version information for other dependencies, we could also include additional version information.


.PHONY: $(ETCD_VER)
$(ETCD_VER):
@echo "\n---> $@ - Downloading etcd\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo -e to properly interpret escape sequences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that applies to Makefile builtin @echo?

Copy link
Contributor

@a-palchikov a-palchikov May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a builtin and what make does is it runs the echo command with the shell. The shell is /bin/sh by default but can be overridden with the SHELL envvar.
Now that I'm thinking about it, a better solution would be, for instance, to default SHELL to /bin/bash and use -e to avoid inconsistencies across environments:

With this makefile as an example:

.PHONY: all
all:
	@echo "\n --> I'm a bad boy\n"
	@echo -e "\n --> I'm a good boy\n"

you get either:

10:18 $ make

 --> I'm a bad boy

-e 
 --> I'm a good boy

or

10:18 $ make
\n --> I'm a bad boy\n

 --> I'm a good boy

(the last being also the result of using /bin/bash as a shell since bash only interprets escape sequence provided -e).
The results are different when I make locally or on jenkins (on jenkins, /bin/sh has been symlinked to /bin/bash)

@@ -70,6 +70,8 @@ type Config struct {
// EtcdInitialCluster configures the value of ETCD_INITIAL_CLUSTER environment variable
// inside the container
EtcdInitialCluster string
//EtcdGatewayList is a list of etcd endpoints that the etcd gateway can use to reach the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// EtcdGatewayList (a whitespace between commant and the sentence start)

return nil
}

// etcdDisable disabled etcd on this machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disables


// etcdEnable enables a disabled etcd node
func etcdEnable(upgradeService bool) error {
if upgradeService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to test for reversed condition to keep code outdented:

func etcdEnable(upgradeService bool) error {
	if !upgradeService {
		return trace.Wrap(enableService(ETCDServiceName))
	}
	// don't actually enable the service if this is a proxy
	env, err := box.ReadEnvironment(ContainerEnvironmentFile)
	if err != nil {
		return trace.Wrap(err)
	}

	if env.Get(EnvEtcdProxy) == EtcdProxyOn {
		log.Infof("etcd is in proxy mode, nothing to do")
		return nil
	}
	return trace.Wrap(enableService(ETCDUpgradeServiceName))
}

}

if env.Get(EnvEtcdProxy) == EtcdProxyOn {
log.Infof("etcd is in proxy mode, nothing to do")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for formatting version Infof w/o placeholders

// this only happens if the service is already running
status, err := getServiceStatus(APIServerServiceName)
if err != nil {
return trace.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to fail the upgrade if you failed to query the status of the API server - it does not look like it's critical to the upgrade operation?
Otherwise you might need to undo a few things to be able to retry (though I'm not sure yet this is possible).

}

func etcdRestore(file string) error {
ctx := context.TODO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: I'd accept the context from the outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not completely sure I agree with you here in this specific case. etcdBackup / etcdRestore come from the dispatch from main based on the kingpin function, where we may not want to use the same context based on which cli command we're in.

However, this comes more from the philosophy of cobra, where the library does the dispatch for you, and the function for the command is really the entrypoint. Example: https://github.com/gravitational/etcd-backup/blob/master/cmd/etcd-backup/backup.go

# instead of the etcd service

[Service]
ExecStart=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra ExecStart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to reset the previous value (here is a faint hint at resetting the values of a list attribute before setting a new one - look at the end of the document).

func readEtcdVersion(path string) (string, error) {
inFile, err := os.Open(path)
if err != nil {
return "", trace.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.ConvertSystemError

func writeEtcdEnvironment(path string, version string) error {
err := os.MkdirAll(filepath.Dir(path), 644)
if err != nil {
return trace.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.ConvertSystemError


f, err := os.Create(path)
if err != nil {
return trace.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.ConvertSystemError

}

func writeEtcdEnvironment(path string, version string) error {
err := os.MkdirAll(filepath.Dir(path), 644)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a constant for 644?

out, err := exec.CommandContext(ctx, "/bin/systemctl", "--no-block", operation, service).CombinedOutput()
log.Infof("%v %v: %v", operation, service, string(out))
if err != nil {
return trace.Wrap(err, fmt.Sprintf("failed to %v %v: %v", operation, service, string(out)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need fmt.Sprintf here.

from = DefaultEtcdStoreBackup
to = DefaultEtcdStoreCurrent
}
err = os.RemoveAll(to)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's dangerous. I think this logic needs a better protection against accidental screw-ups - otherwise if "upgrade" somehow runs twice without "rollback", it will nuke the only backup on the second run.

Maybe it can append some ID (e.g. upgrade operation ID) to the backup dir to make it per-operation unique and instead of happily deleting it, refuse to proceed if it's not empty (thus, mandating rollback first).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if you run the step twice, it will early exit before this is hit, due to CurrentVersion == Desired version.

However, if it fails part way through the step, I think the filesystem could be inconsistent that would allow it to continue and delete the data. I'm not sure appending an ID really helps here, I think it would be susceptible to similar problems or would just leave a bunch of data lying around (which some customers are having trouble with as is). That said, I think it should be possible to make this code re-entrant, so that if it's run multiple times it will skip steps already completed and continue where it left off.

// symlink /usr/bin/etcd to the version we expect to be running
for _, path := range []string{"/usr/bin/etcd", "/usr/bin/etcdctl"} {
// ignore the error from os.Remove, since we don't care if it fails
_ = os.Remove(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of _ = then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought using _ = was a pattern for telling linters (and other devs) I'm intentionally swallowing the error.

if _, err := os.Stat(path); err == nil {
return readEtcdVersion(path)
}
return AssumeEtcdVersion, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at how this function is used and TBH the fact that it falls back to this "assumed" version makes the logic a bit harder to follow (for example, in etcdInit()). Also, IIUC it will still return 2.3.8 initially for new installs even though they're running 3.x? I would just have it return NotFound and handle on the caller side.

}
log.Info("Desired etcd version: ", desiredVersion)

if currentVersion == AssumeEtcdVersion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was talking about in the currentEtcdVersion() method comment (see somewhere below). It returned 2.3.8 while in fact it's not.

func (*EtcdSuite) TestEtcdAssumedVersion(c *check.C) {
ver, err := currentEtcdVersion("/this/file/doesnt/exist")
if err != nil {
c.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a longer version of:

c.Assert(err, IsNil)

func (*EtcdSuite) TestEtcdParseFile(c *check.C) {
file, err := ioutil.TempFile(os.TempDir(), "prefix")
if err != nil {
c.Fatal(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.Assert(err, IsNil)? This will also mark the test as failed.


// reading an empty file should return an error
ver, err := currentEtcdVersion(file.Name())
c.Assert(err, check.NotNil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for specific error expectation (type or a message).

// reading a missing file should return an error
ver, err = readEtcdVersion("/this/file/doesnt/exist")
c.Assert(err, check.NotNil)
c.Assert(ver, check.Equals, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth testing.


// write a version file then check it
version := "1.1.1"
file.WriteString(fmt.Sprintf("%v=%v", EnvEtcdVersion, version))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not ignore errors here.
There's also a helper in fmt:

fmt.Fprintf(file, "%v=%v", EnvEtcdVersion, version)

@@ -161,6 +161,23 @@ func run() error {
cetcdPromoteInitialCluster = cetcdPromote.Flag("initial-cluster", "Initial cluster, as output by 'member add' command").Required().String()
cetcdPromoteInitialClusterState = cetcdPromote.Flag("initial-cluster-state", "Initial cluster state, as output by 'member add' command").Required().String()

cetcdInit = cetcd.Command("init", "setup etcd to run the correct version").Hidden()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setup etcd... - Capitalized as in other descriptions.

cetcdEnable = cetcd.Command("enable", "Enable etcd on this node")
cetcdEnableUpgrade = cetcdEnable.Flag("upgrade", "enable the upgrade service").Bool()

cetcdUpgrade = cetcd.Command("upgrade", "Upgrade etcd to latest available in this container")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... latest version ...

cetcdRollback = cetcd.Command("rollback", "Rollback etcd to the previous release")

cetcdRestore = cetcd.Command("restore", "Restore etcd backup as part of the upgrade")
cetcdRestoreFile = cetcdRestore.Arg("file", "A previously taken backup file to use during upgrade").Required().String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use kingpin's ExistingFile mixin to enfocre this:

cetcdRestoreFile = cetcdRestore.Arg("file", "A previously taken backup file to use during upgrade").Required().ExistingFile()

@@ -430,6 +437,35 @@ func addEtcdOptions(config *Config) {
}
}

// setupEtcd runs setup tasks for etcd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use dots to end sentences.

return trace.Wrap(err)
}
} else {
err := os.Remove(dropinPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for etcdproxy != "on" instead and don't use the else block as it is redundant:

func setupEtcd(config *Config) error {
	dropinPath := path.Join(config.Rootfs, ETCDGatewayDropinPath)

	if strings.ToLower(config.EtcdProxy) != "on" {
		err := os.Remove(dropinPath)
		if err != nil && !os.IsNotExist(err) {
			return trace.Wrap(err)
		}
		return nil
	}
	err := os.MkdirAll(path.Join(config.Rootfs, "etc/systemd/system/etcd.service.d/"), 0755)
	if err != nil {
		return trace.Wrap(err)
	}

	err = os.Symlink(
		"/lib/systemd/system/etcd-gateway.dropin",
		dropinPath,
	)
	if err != nil && !os.IsExist(err) {
		return trace.Wrap(err)
	}
	return nil
}

for _, service := range services {
status, err := getServiceStatus(service)
if err != nil {
log.Warnf("Failed to query status of service %v. Continuing upgrade. Error: %v", service, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected scenario? When a service status can't be queried?

}
}
} else {
// in order to upgrade, write the new version to to disk with the current version as backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to to" -> "to"

if err != nil {
return trace.Wrap(err)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to clean up data that can possibly be left after an upgrade attempt on rollback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens on the next upgrade attempt:

planet/tool/planet/etcd.go

Lines 315 to 320 in 182dff9

// wipe data directory of any previous upgrade attempt
path := path.Join(getBaseEtcdDir(desiredVersion), "member")
err = os.RemoveAll(path)
if err != nil && !os.IsNotExist(err) {
return trace.ConvertSystemError(err)
}

// wipe data directory of any previous upgrade attempt
path := path.Join(getBaseEtcdDir(desiredVersion), "member")
err = os.RemoveAll(path)
if err != nil && !os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsNotExist check is redundant. From the docs: If the path does not exist, RemoveAll returns nil (no error)..


// wipe old backups leftover from previous upgrades
// Note: if this fails, but previous steps were successfull, the backups won't get cleaned up
if backupVersion != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a version of a previous backup, not backup for this upgrade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct.

}

func writeEtcdEnvironment(path string, version string, prevVersion string) error {
err := os.MkdirAll(filepath.Dir(path), 644)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

644 -> constants.SharedReadMask

// the problem is, when shutting down etcd, systemd will respond when the process has been told to shutdown
// but this leaves a race, where we might be continuing while etcd is still cleanly shutting down
func waitForEtcdStopped(ctx context.Context) error {
tick := time.Tick(WaitInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ticker := time.NewTicker instead and defer ticker.Close() it. From the docs:

While Tick is useful for clients that have no need to shut down the Ticker, be aware that without a way to shut it down the underlying Ticker cannot be recovered by the garbage collector; it "leaks".

}

func etcdRestore(file string) error {
ctx := context.TODO()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have all etcd* functions accept context as a parameter, even if it'd be passed as context.TODO() from main.go for now. This way it will be much easier to, for example, add a "--timeout" flag to all these commands so client can control it. Right now these internal context are pretty much useless.

}

log.Infof("removing %v", ETCDProxyDir)
if err := os.RemoveAll(ETCDProxyDir); err != nil {
if err := os.RemoveAll(ETCDProxyDir); err != nil && !os.IsNotExist(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsNotExist check is redundant.

CAFile: DefaultEtcdctlCAFile,
},
Prefix: []string{"/"}, // Restore all etcd data
MigratePrefix: []string{"/registry"}, // migrate kubernetes data to etcd3 datastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a constant for /registry as far as I can see.

@knisbet knisbet merged commit 5376dcb into master May 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants