Allow pausing clusters #301

kragniz · 2018-03-27T11:46:11Z

What this PR does / why we need it: this allows users to 'pause' a cluster, making navigator skip syncing. This allows for manual intervention in case something goes wrong.

Release note:

Allow pausing clusters

munnerz · 2018-03-27T12:55:54Z

pkg/apis/navigator/types.go

@@ -32,6 +32,7 @@ type CassandraClusterSpec struct {
 	NodePools []CassandraClusterNodePool
 	Version   version.Version
 	Image     *ImageSpec
+	Paused    *bool


Does this need to be a pointer? (i.e. is there a 3rd 'unset' state that we need to support?).

IMO paused: false and paused: true are expressive enough?

I did this so that omitempty hides the field when it's not set, but I don't think that's actually necessary. I'll change it to bool

munnerz · 2018-03-27T12:56:36Z

pkg/apis/navigator/v1alpha1/types.go

@@ -39,6 +39,9 @@ type CassandraClusterSpec struct {

 	// The version of the database to be used for nodes in the cluster.
 	Version version.Version `json:"version"`
+
+	// If set to true, no actions will take place on this cluster.
+	Paused *bool `json:"paused,omitempty"`


Can this be a part of the NavigatorClusterConfig structure?

munnerz · 2018-03-27T12:57:26Z

pkg/controllers/cassandra/cluster_control.go

@@ -73,6 +73,12 @@ func NewControl(
 }

 func (e *defaultCassandraClusterControl) Sync(c *v1alpha1.CassandraCluster) error {
+	if c.Spec.Paused != nil && *c.Spec.Paused == true {
+		glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused")
+		e.recorder.Eventf(c, apiv1.EventTypeNormal, "spec.paused", "Cluster paused, not syncing")


The event name here (spec.paused) should be pulled into a const, and formatted similar to the other events.

(we also need to move them into pkg/apis/navigator/v1alpha1 at some point)

munnerz · 2018-03-27T17:47:45Z

pkg/controllers/elasticsearch/cluster_control.go

@@ -29,13 +29,16 @@ import (
 const (
 	errorSync = "ErrSync"

+	pauseField = "spec.paused"


Can you update this to be like the others? (i.e. caps case) e.g. Paused?

I was trying to be consistent with event messages like this one:

cass-example2-ringnodes-0.151fd039c8fa1dd1 Pod spec.initContainers{install-pilot} Normal Started kubelet, minikube Started container

But I'll change it if Paused is the preferred way

munnerz · 2018-04-03T11:22:35Z

Interestingly, the deployment controller does not pause scale actions when spec.paused is set. It also does not log an event about it, but instead sets a DeploymentProgressing condition (in this function: https://github.com/kubernetes/kubernetes/blob/1102fd0dcbc4a408045e8d1bc42f056909e72322/pkg/controller/deployment/sync.go#L75).

Given the nature of a scale event in an ES or C* cluster, I think we should also pause scale events in the case of our pause feature, but maybe we want to borrow the behaviour around setting a condition/logging an event (i.e. don't log an event, instead set a condition on the resource).

We've not really utilised conditions up until now, but they do provide a nice and clear way to provide insight into the state of a resource (and save hammering the events API with duplicate events). An event probably shouldn't be used to report the state of objects, but instead used to report information on an action (perhaps not necessarily just 'Actions' as in our definition of an action) that has been taken (whether it has succeeded, or conversely if it errored then reporting some information here too). There may be some information that makes sense to duplicate as both a condition and an event, but I'm not sure. I guess it depends on the nature of the action.

wallrj

Maybe add the spec.paused: false the quick-start manifests to make it clear that the option is available.
(maybe commented out)
What about Pilot control loop? Should it stop when the cluster is paused? Perhaps a scale down has been started, by mistake but you want to quickly prevent the pilot from commencing Decommissioning.

wallrj · 2018-04-03T16:39:25Z

Some interesting thoughts here on PauseCondition: kubernetes/kubernetes#58465 (comment)

munnerz · 2018-04-11T11:44:44Z

pkg/apis/navigator/types.go

+	ClusterProgressing ClusterConditionType = "Progressing"
+	// ReplicaFailure is added in a cluster when one of its pods fails to be created
+	// or deleted.
+	ClusterReplicaFailure ClusterConditionType = "ReplicaFailure"


I can't see where we actually use/set these conditions?

Also, update comments (assuming they've come from Deployment defs) and remove conditions we don't currently implement.

Actually implemented now! (I should have added wip back to the title)

kragniz · 2018-04-11T15:26:52Z

/retest

wallrj

Thanks @kragniz

I spotted a few places where errors aren't being handled and suggested some further documentation of the conditions.

Now I see it in action, the conditions do seem a bit confusing to me, and I think I'd probably opt for a simple CassandraCluster.Status.Paused flag ....but we've already had that discussion :-)....this is great for now.

Please answer or address the comments below.

wallrj · 2018-04-11T16:39:03Z

docs/paused-clusters.rst

+----------------
+
+A cluster can be paused by setting ``spec.paused`` to ``true``.
+When this is set, navigator will not perform any actions on the cluster, allowing for manual intervention.


typo "navigator" > "Navigator"

I think it'd be worth adding a paragraph about the conditions you can expect to find when the cluster is paused and when it has resumed.

wallrj · 2018-04-12T08:20:38Z

pkg/controllers/cassandra/cluster_control.go

+// checkPausedConditions checks if the given cluster is paused or not and adds an appropriate condition.
+func (e *defaultCassandraClusterControl) checkPausedConditions(c *v1alpha1.CassandraCluster) error {
+	cond := c.Status.GetStatusCondition(v1alpha1.ClusterConditionProgressing)
+	pausedCondExists := cond != nil && cond.Reason == v1alpha1.PausedClusterReason


Should we also look for cond.Status == True here?

wallrj · 2018-04-12T08:21:54Z

pkg/controllers/cassandra/cluster_control.go

+	if c.Spec.Paused && !pausedCondExists {
+		c.Status.UpdateStatusCondition(
+			v1alpha1.ClusterConditionProgressing,
+			v1alpha1.ConditionUnknown,


Shouldn't this be ConditionFalse i.e. The cluster is not progressing

wallrj · 2018-04-12T08:22:46Z

pkg/controllers/cassandra/cluster_control.go

+	} else if !c.Spec.Paused && pausedCondExists {
+		c.Status.UpdateStatusCondition(
+			v1alpha1.ClusterConditionProgressing,
+			v1alpha1.ConditionUnknown,


And should this be ConditionTrue ?

wallrj · 2018-04-12T08:23:51Z

pkg/controllers/cassandra/cluster_control.go

+	}
+
+	var err error
+	c, err = e.state.NavigatorClientset.NavigatorV1alpha1().CassandraClusters(c.Namespace).UpdateStatus(c)


nit c is unused.

wallrj · 2018-04-12T08:25:28Z

pkg/controllers/cassandra/cluster_control.go

+	c = c.DeepCopy()
+	var err error
+
+	e.checkPausedConditions(c)


Check and return the error here so that the sync and checkPausedConditions will be retried in case of e.g. conflict errors.

wallrj · 2018-04-12T08:25:56Z

pkg/controllers/elasticsearch/cluster_control.go

 func (e *defaultElasticsearchClusterControl) Sync(c *v1alpha1.ElasticsearchCluster) (v1alpha1.ElasticsearchClusterStatus, error) {
 	c = c.DeepCopy()
 	var err error

+	e.checkPausedConditions(c)


And check and return error here too.

wallrj · 2018-04-12T08:30:32Z

pkg/controllers/elasticsearch/cluster_control.go

+	c, err = e.navigatorClient.NavigatorV1alpha1().ElasticsearchClusters(c.Namespace).UpdateStatus(c)
+	return err
+}
+


Maybe if this method instead mutated the Cluster Status in place and if there was an interface for cluster condition operations, it could be made into a generic function shared between cassandra and elasticsearch.
And the status is updated lower by Controller.sync lower in the stack, so we could avoid a separate API call.

Happy if you prefer to leave this for a followup branch.

wallrj · 2018-04-12T09:22:22Z

The E2E test failures all look like unrelated flakes: ElasticSearch pod didn't become ready, Cassandra node didn't become ready....maybe because the test infrastructure was overloaded.

As discussed, it'd be worth adding some E2E tests for cluster pausing though.

munnerz · 2018-04-12T09:25:30Z

3/3 failed tests, even if they are on a flake, isn't a healthy state to be in. FWIW, our test infrastructure is quite over-provisioned at the moment so I doubt it's down to overload.

Perhaps we've got some other component here that's flakey?

jetstack-bot · 2018-05-08T15:53:28Z

@kragniz: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
navigator-e2e-v1-10	`c5c798e`	link	`/test e2e v1.10`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wallrj

Some things to consider before merging:

Add E2E tests for this feature.
Add documentation, describing the conditions.
Looks like we should add API validation to prevent other Spec attributes from being updated by default....except paused, version and nodepool map items.

Create followup issues if you prefer to tackle those later.

wallrj · 2018-05-09T10:53:24Z

pkg/controllers/cassandra/cluster_control.go

+		)
+	}
+
+	return nil


Looks like this function never returns an error, so maybe don't return anything.

wallrj · 2018-05-09T10:53:49Z

pkg/controllers/cassandra/cluster_control.go

+	err = e.syncPausedConditions(c)
+	if err != nil {
+		return err
+	}


Consider removing the returned err and err check here.

wallrj · 2018-05-09T10:55:13Z

pkg/controllers/cassandra/cluster_control.go

+	}
+
+	if c.Spec.Paused == true {
+		glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused")


Consider making this a warning. In production, the L4 logs might be discarded.

I've changed it to glog.Infof

wallrj · 2018-05-09T10:58:28Z

pkg/controllers/cassandra/cluster_control.go

+			v1alpha1.ResumedClusterReason,
+			"Cluster is resumed",
+		)
+	}


Should there be a final condition here to set the v1alpha1.ClusterConditionProgressing condition for a cluster that has never been paused?

I think this should be done by other components that progress the cluster.

For example, the deployment controller sets conditions such as:

- lastTransitionTime: 2018-05-08T14:14:46Z lastUpdateTime: 2018-05-08T14:14:56Z message: ReplicaSet "navigator-navigator-apiserver-86c554b4f" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing

maybe we should add similar conditions for each action performed on a cluster?

wallrj · 2018-05-09T10:58:49Z

pkg/controllers/elasticsearch/cluster_control.go

+	}
+
+	if c.Spec.Paused == true {
+		glog.V(4).Infof("defaultElasticsearchClusterControl.Sync skipped, since cluster is paused")


Maybe a warning instead (see above)

wallrj · 2018-05-15T13:27:00Z

/lgtm
/approve

jetstack-bot · 2018-05-15T13:27:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wallrj]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jetstack-bot added release-note size/M labels Mar 27, 2018

jetstack-bot requested a review from munnerz March 27, 2018 11:46

munnerz reviewed Mar 27, 2018

View reviewed changes

kragniz force-pushed the pause-clusters branch from c5c798e to 9fb3a4e Compare March 27, 2018 15:25

munnerz reviewed Mar 27, 2018

View reviewed changes

munnerz added this to the v0.1 milestone Mar 27, 2018

munnerz assigned kragniz and munnerz and unassigned kragniz Mar 27, 2018

wallrj reviewed Apr 3, 2018

View reviewed changes

kragniz force-pushed the pause-clusters branch from 9fb3a4e to 8e39550 Compare April 10, 2018 10:36

jetstack-bot added size/L and removed size/M labels Apr 10, 2018

munnerz reviewed Apr 11, 2018

View reviewed changes

kragniz force-pushed the pause-clusters branch from a922f46 to 44c0dd2 Compare April 11, 2018 16:05

wallrj suggested changes Apr 12, 2018

View reviewed changes

kragniz force-pushed the pause-clusters branch from 44c0dd2 to ab482f6 Compare April 16, 2018 15:07

jetstack-bot added the needs-rebase label Apr 18, 2018

kragniz added 4 commits May 8, 2018 11:39

Add paused fields to cluster specs

f7e9a04

Skip syncing cassandra cluster if paused

ac27e8c

Skip syncing es cluster if paused

8217464

Add docs

653a680

kragniz added 7 commits May 8, 2018 12:06

Add Condition to condition type names

c0708e2

Add condition helper functions

74f0f97

Update cassandra condition on pause

2a6ab30

Update es condition on pause

50478aa

DeepCopy

088e626

Remove old event messages

8ae0646

Fix typo

e61de02

kragniz force-pushed the pause-clusters branch from ab482f6 to e61de02 Compare May 8, 2018 11:06

jetstack-bot removed the needs-rebase label May 8, 2018

kragniz added 5 commits May 8, 2018 14:09

Set conditions to true and false

af6fbde

Remove unused variable

52f11c4

Check return values for checkPausedConditions

6795e07

Set conditions to true and false

88f5f74

Remove extra status sync

be8157a

Remove extra DeepCopy

c9fd39c

wallrj approved these changes May 9, 2018

View reviewed changes

kragniz added 2 commits May 9, 2018 14:15

Remove extra error return

ff86c0d

Add doc about paused conditions

c3fded2

jetstack-bot assigned wallrj May 15, 2018

jetstack-bot added the lgtm label May 15, 2018

jetstack-bot added the approved label May 15, 2018

This was referenced May 15, 2018

Cluster Pausing feature is not tested #356

Open

The API server should forbid changes to cluster spec by default #357

Open

jetstack-bot merged commit 80ade4c into jetstack:master May 15, 2018

wallrj modified the milestones: v0.1, v0.2 May 15, 2018

wallrj added the kind/feature label May 15, 2018

Allow pausing clusters #301

Allow pausing clusters #301

Conversation

kragniz commented Mar 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

munnerz commented Apr 3, 2018

wallrj left a comment

Choose a reason for hiding this comment

wallrj commented Apr 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kragniz commented Apr 11, 2018

wallrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj commented Apr 12, 2018

munnerz commented Apr 12, 2018

jetstack-bot commented May 8, 2018 • edited Loading

wallrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wallrj commented May 15, 2018

jetstack-bot commented May 15, 2018

jetstack-bot commented May 8, 2018 •

edited

Loading