-
Notifications
You must be signed in to change notification settings - Fork 31
Conversation
pkg/apis/navigator/types.go
Outdated
@@ -32,6 +32,7 @@ type CassandraClusterSpec struct { | |||
NodePools []CassandraClusterNodePool | |||
Version version.Version | |||
Image *ImageSpec | |||
Paused *bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a pointer? (i.e. is there a 3rd 'unset' state that we need to support?).
IMO paused: false
and paused: true
are expressive enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this so that omitempty hides the field when it's not set, but I don't think that's actually necessary. I'll change it to bool
pkg/apis/navigator/v1alpha1/types.go
Outdated
@@ -39,6 +39,9 @@ type CassandraClusterSpec struct { | |||
|
|||
// The version of the database to be used for nodes in the cluster. | |||
Version version.Version `json:"version"` | |||
|
|||
// If set to true, no actions will take place on this cluster. | |||
Paused *bool `json:"paused,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a part of the NavigatorClusterConfig structure?
@@ -73,6 +73,12 @@ func NewControl( | |||
} | |||
|
|||
func (e *defaultCassandraClusterControl) Sync(c *v1alpha1.CassandraCluster) error { | |||
if c.Spec.Paused != nil && *c.Spec.Paused == true { | |||
glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused") | |||
e.recorder.Eventf(c, apiv1.EventTypeNormal, "spec.paused", "Cluster paused, not syncing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The event name here (spec.paused
) should be pulled into a const, and formatted similar to the other events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we also need to move them into pkg/apis/navigator/v1alpha1
at some point)
@@ -29,13 +29,16 @@ import ( | |||
const ( | |||
errorSync = "ErrSync" | |||
|
|||
pauseField = "spec.paused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to be like the others? (i.e. caps case) e.g. Paused
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be consistent with event messages like this one:
cass-example2-ringnodes-0.151fd039c8fa1dd1 Pod spec.initContainers{install-pilot} Normal Started kubelet, minikube Started container
But I'll change it if Paused
is the preferred way
Interestingly, the deployment controller does not pause scale actions when Given the nature of a scale event in an ES or C* cluster, I think we should also pause scale events in the case of our We've not really utilised conditions up until now, but they do provide a nice and clear way to provide insight into the state of a resource (and save hammering the events API with duplicate events). An event probably shouldn't be used to report the state of objects, but instead used to report information on an action (perhaps not necessarily just 'Actions' as in our definition of an action) that has been taken (whether it has succeeded, or conversely if it errored then reporting some information here too). There may be some information that makes sense to duplicate as both a condition and an event, but I'm not sure. I guess it depends on the nature of the action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe add the
spec.paused: false
the quick-start manifests to make it clear that the option is available.
(maybe commented out) - What about Pilot control loop? Should it stop when the cluster is paused? Perhaps a scale down has been started, by mistake but you want to quickly prevent the pilot from commencing Decommissioning.
Some interesting thoughts here on |
pkg/apis/navigator/types.go
Outdated
ClusterProgressing ClusterConditionType = "Progressing" | ||
// ReplicaFailure is added in a cluster when one of its pods fails to be created | ||
// or deleted. | ||
ClusterReplicaFailure ClusterConditionType = "ReplicaFailure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see where we actually use/set these conditions?
Also, update comments (assuming they've come from Deployment defs) and remove conditions we don't currently implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually implemented now! (I should have added wip back to the title)
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kragniz
I spotted a few places where errors aren't being handled and suggested some further documentation of the conditions.
Now I see it in action, the conditions do seem a bit confusing to me, and I think I'd probably opt for a simple CassandraCluster.Status.Paused
flag ....but we've already had that discussion :-)....this is great for now.
Please answer or address the comments below.
docs/paused-clusters.rst
Outdated
---------------- | ||
|
||
A cluster can be paused by setting ``spec.paused`` to ``true``. | ||
When this is set, navigator will not perform any actions on the cluster, allowing for manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "navigator" > "Navigator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be worth adding a paragraph about the conditions you can expect to find when the cluster is paused and when it has resumed.
// checkPausedConditions checks if the given cluster is paused or not and adds an appropriate condition. | ||
func (e *defaultCassandraClusterControl) checkPausedConditions(c *v1alpha1.CassandraCluster) error { | ||
cond := c.Status.GetStatusCondition(v1alpha1.ClusterConditionProgressing) | ||
pausedCondExists := cond != nil && cond.Reason == v1alpha1.PausedClusterReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also look for cond.Status == True
here?
if c.Spec.Paused && !pausedCondExists { | ||
c.Status.UpdateStatusCondition( | ||
v1alpha1.ClusterConditionProgressing, | ||
v1alpha1.ConditionUnknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be ConditionFalse
i.e. The cluster is not progressing
} else if !c.Spec.Paused && pausedCondExists { | ||
c.Status.UpdateStatusCondition( | ||
v1alpha1.ClusterConditionProgressing, | ||
v1alpha1.ConditionUnknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And should this be ConditionTrue
?
} | ||
|
||
var err error | ||
c, err = e.state.NavigatorClientset.NavigatorV1alpha1().CassandraClusters(c.Namespace).UpdateStatus(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit c
is unused.
c = c.DeepCopy() | ||
var err error | ||
|
||
e.checkPausedConditions(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check and return the error here so that the sync and checkPausedConditions will be retried in case of e.g. conflict errors.
func (e *defaultElasticsearchClusterControl) Sync(c *v1alpha1.ElasticsearchCluster) (v1alpha1.ElasticsearchClusterStatus, error) { | ||
c = c.DeepCopy() | ||
var err error | ||
|
||
e.checkPausedConditions(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And check and return error here too.
c, err = e.navigatorClient.NavigatorV1alpha1().ElasticsearchClusters(c.Namespace).UpdateStatus(c) | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if this method instead mutated the Cluster Status in place and if there was an interface for cluster condition operations, it could be made into a generic function shared between cassandra and elasticsearch.
And the status is updated lower by Controller.sync
lower in the stack, so we could avoid a separate API call.
Happy if you prefer to leave this for a followup branch.
The E2E test failures all look like unrelated flakes: ElasticSearch pod didn't become ready, Cassandra node didn't become ready....maybe because the test infrastructure was overloaded. As discussed, it'd be worth adding some E2E tests for cluster pausing though. |
3/3 failed tests, even if they are on a flake, isn't a healthy state to be in. FWIW, our test infrastructure is quite over-provisioned at the moment so I doubt it's down to overload. Perhaps we've got some other component here that's flakey? |
@kragniz: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things to consider before merging:
- Add E2E tests for this feature.
- Add documentation, describing the conditions.
- Looks like we should add API validation to prevent other Spec attributes from being updated by default....except paused, version and nodepool map items.
Create followup issues if you prefer to tackle those later.
) | ||
} | ||
|
||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this function never returns an error, so maybe don't return anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
err = e.syncPausedConditions(c) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing the returned err and err check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
if c.Spec.Paused == true { | ||
glog.V(4).Infof("defaultCassandraClusterControl.Sync skipped, since cluster is paused") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this a warning. In production, the L4 logs might be discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it to glog.Infof
v1alpha1.ResumedClusterReason, | ||
"Cluster is resumed", | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a final condition here to set the v1alpha1.ClusterConditionProgressing
condition for a cluster that has never been paused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be done by other components that progress the cluster.
For example, the deployment controller sets conditions such as:
- lastTransitionTime: 2018-05-08T14:14:46Z
lastUpdateTime: 2018-05-08T14:14:56Z
message: ReplicaSet "navigator-navigator-apiserver-86c554b4f" has successfully
progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
maybe we should add similar conditions for each action performed on a cluster?
} | ||
|
||
if c.Spec.Paused == true { | ||
glog.V(4).Infof("defaultElasticsearchClusterControl.Sync skipped, since cluster is paused") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a warning instead (see above)
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wallrj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it: this allows users to 'pause' a cluster, making navigator skip syncing. This allows for manual intervention in case something goes wrong.
Release note: