Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Cassandra scale in action #289

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

kragniz
Copy link
Contributor

@kragniz kragniz commented Mar 16, 2018

What this PR does / why we need it: add an action for scaling in cassandra clusters

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #285

Special notes for your reviewer: this is based on top of #256, will be rebased on master when that gets merged

Allow scaling in cassandra clusters

@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: munnerz

Assign the PR to them by writing /assign @munnerz in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kragniz kragniz changed the title WIP: Cassandra scale in action Cassandra scale in action May 9, 2018
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kragniz

return fmt.Errorf(
"Not enough pilots to scale down: %d",
len(pilots),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log an error here and return nil.
The controller interprets an error to mean that the action should be retried.
But retrying will never succeed in this situation....I think.

corev1.EventTypeNormal,
a.Name(),
"Marked cassandra pilot %s for decommission", p.Name,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the namespace and name of the decommissioned pilot here.
And maybe the event should be recorded against the cluster object, so that an administrator can see all the events for a particular cluster.

"is greater than the existing StatefulSet.Replicas value (%d)",
a.NodePool.Replicas, *ss.Spec.Replicas,
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Might be neater to start the function with checks for all the bad inputs and exit early. Then end with the happy path.
  • And I think this error should be logged rather than returned, to prevent the controller retrying this doomed action.

}
}
errs = append(errs, fmt.Errorf("pilot %q not found", pilotName))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out why we need this inner loop.
I imagined we'd get a list of all pilots for the nodepool and decommission all the ones with an index higher than the desired replica count.
Maybe this can be simplified? Or else add a function doc explaining the algorithm.

Message: utilerror.NewAggregate(errs).Error(),
Reason: metav1.StatusReasonNotFound,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never seen k8sErrors.StatusError used before. What's the purpose? I looked to see if we check for this error elsewhere, but couldn't see any such checks.

return &actions.ScaleIn{
Cluster: c,
NodePool: &np,
}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keen to keep this function unchanged :-)
I think it should be possible to instead check nps.ReadyReplicas here...if not, let's discuss.

a, err := cassandra.NextAction(c, state.StatefulSetLister)
if err != nil {
t.Errorf("error calculating next action: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we could keep the NextAction signature unchanged, we wouldn't need to add this extra stuff to the test.

func run(args ...string) error {
cmd := exec.Command(args[0], args[1:]...)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture the stdout / stderr here and also add it to the error that we return.
Otherwise I think it'll be difficult to diagnose decommission failures.

return run("nodetool", "decommission")
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return an error if the node state != NodeStateNormal.
Otherwise the unhealthy node will be marked as decommissioned in the pilot status, right?

if p.decommissionInProgress || true {
glog.Info("decommission in progress, reporting success for liveness")
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, sorry about this.
I remember you saying that jolokia nodetool requests fails if the node has been decommissioned.
I wonder if we need to rethink this check.

@jetstack-bot
Copy link
Collaborator

@kragniz: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cassandra scale down action
3 participants