Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Allow pausing clusters #301

Merged
merged 25 commits into from
May 15, 2018
Merged

Allow pausing clusters #301

merged 25 commits into from
May 15, 2018

Conversation

kragniz
Copy link
Contributor

@kragniz kragniz commented Mar 27, 2018

What this PR does / why we need it: this allows users to 'pause' a cluster, making navigator skip syncing. This allows for manual intervention in case something goes wrong.

Release note:

Allow pausing clusters

@@ -32,6 +32,7 @@ type CassandraClusterSpec struct {
NodePools []CassandraClusterNodePool
Version version.Version
Image *ImageSpec
Paused *bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a pointer? (i.e. is there a 3rd 'unset' state that we need to support?).

IMO paused: false and paused: true are expressive enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this so that omitempty hides the field when it's not set, but I don't think that's actually necessary. I'll change it to bool

@@ -39,6 +39,9 @@ type CassandraClusterSpec struct {

// The version of the database to be used for nodes in the cluster.
Version version.Version `json:"version"`

// If set to true, no actions will take place on this cluster.
Paused *bool `json:"paused,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a part of the NavigatorClusterConfig structure?

@@ -73,6 +73,12 @@ func NewControl(
}

func (e *defaultCassandraClusterControl) Sync(c *v1alpha1.CassandraCluster) error {
if c.Spec.Paused != nil && *c.Spec.Paused == true {
glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused")
e.recorder.Eventf(c, apiv1.EventTypeNormal, "spec.paused", "Cluster paused, not syncing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event name here (spec.paused) should be pulled into a const, and formatted similar to the other events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we also need to move them into pkg/apis/navigator/v1alpha1 at some point)

@@ -29,13 +29,16 @@ import (
const (
errorSync = "ErrSync"

pauseField = "spec.paused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to be like the others? (i.e. caps case) e.g. Paused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be consistent with event messages like this one:

cass-example2-ringnodes-0.151fd039c8fa1dd1   Pod       spec.initContainers{install-pilot}   Normal    Started   kubelet, minikube   Started container

But I'll change it if Paused is the preferred way

@munnerz munnerz added this to the v0.1 milestone Mar 27, 2018
@munnerz munnerz assigned kragniz and munnerz and unassigned kragniz Mar 27, 2018
@munnerz
Copy link
Contributor

munnerz commented Apr 3, 2018

Interestingly, the deployment controller does not pause scale actions when spec.paused is set. It also does not log an event about it, but instead sets a DeploymentProgressing condition (in this function: https://github.com/kubernetes/kubernetes/blob/1102fd0dcbc4a408045e8d1bc42f056909e72322/pkg/controller/deployment/sync.go#L75).

Given the nature of a scale event in an ES or C* cluster, I think we should also pause scale events in the case of our pause feature, but maybe we want to borrow the behaviour around setting a condition/logging an event (i.e. don't log an event, instead set a condition on the resource).

We've not really utilised conditions up until now, but they do provide a nice and clear way to provide insight into the state of a resource (and save hammering the events API with duplicate events). An event probably shouldn't be used to report the state of objects, but instead used to report information on an action (perhaps not necessarily just 'Actions' as in our definition of an action) that has been taken (whether it has succeeded, or conversely if it errored then reporting some information here too). There may be some information that makes sense to duplicate as both a condition and an event, but I'm not sure. I guess it depends on the nature of the action.

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe add the spec.paused: false the quick-start manifests to make it clear that the option is available.
    (maybe commented out)
  • What about Pilot control loop? Should it stop when the cluster is paused? Perhaps a scale down has been started, by mistake but you want to quickly prevent the pilot from commencing Decommissioning.

@wallrj
Copy link
Member

wallrj commented Apr 3, 2018

Some interesting thoughts here on PauseCondition: kubernetes/kubernetes#58465 (comment)

ClusterProgressing ClusterConditionType = "Progressing"
// ReplicaFailure is added in a cluster when one of its pods fails to be created
// or deleted.
ClusterReplicaFailure ClusterConditionType = "ReplicaFailure"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where we actually use/set these conditions?

Also, update comments (assuming they've come from Deployment defs) and remove conditions we don't currently implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually implemented now! (I should have added wip back to the title)

@kragniz
Copy link
Contributor Author

kragniz commented Apr 11, 2018

/retest

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kragniz

I spotted a few places where errors aren't being handled and suggested some further documentation of the conditions.

Now I see it in action, the conditions do seem a bit confusing to me, and I think I'd probably opt for a simple CassandraCluster.Status.Paused flag ....but we've already had that discussion :-)....this is great for now.

Please answer or address the comments below.

----------------

A cluster can be paused by setting ``spec.paused`` to ``true``.
When this is set, navigator will not perform any actions on the cluster, allowing for manual intervention.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "navigator" > "Navigator"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be worth adding a paragraph about the conditions you can expect to find when the cluster is paused and when it has resumed.

// checkPausedConditions checks if the given cluster is paused or not and adds an appropriate condition.
func (e *defaultCassandraClusterControl) checkPausedConditions(c *v1alpha1.CassandraCluster) error {
cond := c.Status.GetStatusCondition(v1alpha1.ClusterConditionProgressing)
pausedCondExists := cond != nil && cond.Reason == v1alpha1.PausedClusterReason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also look for cond.Status == True here?

if c.Spec.Paused && !pausedCondExists {
c.Status.UpdateStatusCondition(
v1alpha1.ClusterConditionProgressing,
v1alpha1.ConditionUnknown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be ConditionFalse i.e. The cluster is not progressing

} else if !c.Spec.Paused && pausedCondExists {
c.Status.UpdateStatusCondition(
v1alpha1.ClusterConditionProgressing,
v1alpha1.ConditionUnknown,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should this be ConditionTrue ?

}

var err error
c, err = e.state.NavigatorClientset.NavigatorV1alpha1().CassandraClusters(c.Namespace).UpdateStatus(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit c is unused.

c = c.DeepCopy()
var err error

e.checkPausedConditions(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check and return the error here so that the sync and checkPausedConditions will be retried in case of e.g. conflict errors.

func (e *defaultElasticsearchClusterControl) Sync(c *v1alpha1.ElasticsearchCluster) (v1alpha1.ElasticsearchClusterStatus, error) {
c = c.DeepCopy()
var err error

e.checkPausedConditions(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And check and return error here too.

c, err = e.navigatorClient.NavigatorV1alpha1().ElasticsearchClusters(c.Namespace).UpdateStatus(c)
return err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if this method instead mutated the Cluster Status in place and if there was an interface for cluster condition operations, it could be made into a generic function shared between cassandra and elasticsearch.
And the status is updated lower by Controller.sync lower in the stack, so we could avoid a separate API call.

Happy if you prefer to leave this for a followup branch.

@wallrj
Copy link
Member

wallrj commented Apr 12, 2018

The E2E test failures all look like unrelated flakes: ElasticSearch pod didn't become ready, Cassandra node didn't become ready....maybe because the test infrastructure was overloaded.

As discussed, it'd be worth adding some E2E tests for cluster pausing though.

@munnerz
Copy link
Contributor

munnerz commented Apr 12, 2018

3/3 failed tests, even if they are on a flake, isn't a healthy state to be in. FWIW, our test infrastructure is quite over-provisioned at the moment so I doubt it's down to overload.

Perhaps we've got some other component here that's flakey?

@jetstack-bot
Copy link
Collaborator

jetstack-bot commented May 8, 2018

@kragniz: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
navigator-e2e-v1-10 c5c798e link /test e2e v1.10

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to consider before merging:

  • Add E2E tests for this feature.
  • Add documentation, describing the conditions.
  • Looks like we should add API validation to prevent other Spec attributes from being updated by default....except paused, version and nodepool map items.

Create followup issues if you prefer to tackle those later.

)
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this function never returns an error, so maybe don't return anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

err = e.syncPausedConditions(c)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing the returned err and err check here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

if c.Spec.Paused == true {
glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a warning. In production, the L4 logs might be discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to glog.Infof

v1alpha1.ResumedClusterReason,
"Cluster is resumed",
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a final condition here to set the v1alpha1.ClusterConditionProgressing condition for a cluster that has never been paused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be done by other components that progress the cluster.

For example, the deployment controller sets conditions such as:

  - lastTransitionTime: 2018-05-08T14:14:46Z
    lastUpdateTime: 2018-05-08T14:14:56Z
    message: ReplicaSet "navigator-navigator-apiserver-86c554b4f" has successfully
      progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing

maybe we should add similar conditions for each action performed on a cluster?

}

if c.Spec.Paused == true {
glog.V(4).Infof("defaultElasticsearchClusterControl.Sync skipped, since cluster is paused")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a warning instead (see above)

@wallrj
Copy link
Member

wallrj commented May 15, 2018

/lgtm
/approve

@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants