Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Allow configuring number of seed nodes per nodepool #264

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

kragniz
Copy link
Contributor

@kragniz kragniz commented Feb 27, 2018

This adds a new field to cassandra nodepools, seeds, which controls the number of seed nodes in that nodepool. This defaults to 1, and cannot be greater than the number of replicas.

Allow configuring number of seed nodes per nodepool

@kragniz
Copy link
Contributor Author

kragniz commented Feb 27, 2018

I'll add an e2e test for this once #258 is merged

@kragniz
Copy link
Contributor Author

kragniz commented Feb 28, 2018

/retest

@jetstack-ci-bot
Copy link
Contributor

@kragniz PR needs rebase

hack/e2e.sh Outdated
<(envsubst \
'$NAVIGATOR_IMAGE_REPOSITORY:$NAVIGATOR_IMAGE_TAG:$NAVIGATOR_IMAGE_PULLPOLICY:$CASS_NAME:$CASS_REPLICAS:$CASS_CQL_PORT' \
< "${SCRIPT_DIR}/testdata/cass-cluster-test.template.yaml")
apply_cassandracluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail the test if this fails? (like we do above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

hack/e2e.sh Outdated
<(envsubst \
'$NAVIGATOR_IMAGE_REPOSITORY:$NAVIGATOR_IMAGE_TAG:$NAVIGATOR_IMAGE_PULLPOLICY:$CASS_NAME:$CASS_REPLICAS:$CASS_CQL_PORT' \
< "${SCRIPT_DIR}/testdata/cass-cluster-test.template.yaml")
apply_cassandracluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail the test if this fails? (like we do above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

hack/e2e.sh Outdated
@@ -325,6 +320,13 @@ function test_cassandracluster() {
fail_test "Second cassandra node did not become ready"
fi

seed_label=$(kubectl get pods --namespace "${namespace}" \
cass-${CASS_NAME}-ringnodes-1 \
-o jsonpath='{.metadata.labels.seed}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(may not be something that's true because of this PR, but we should always be using namespaced labels e.g. navigator.jetstack.io/cassandra-seed=true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this needs rebasing when #270 lands, which makes the label navigator.jetstack.io/cassandra-seed

@@ -7,3 +7,13 @@ import (
func addDefaultingFuncs(scheme *runtime.Scheme) error {
return RegisterDefaults(scheme)
}

func SetDefaults_CassandraClusterSpec(spec *CassandraClusterSpec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default the cassandraclusterspec if we're only going to then iterate over each node pool in it anyway?

Surely this is better to be a SetDefaults_CassandraClusterNodePool function?

This leads onto my next question.. what if a user wants to ensure there are 0 seeds in a given node pool (as they have seeds in other node pools). Is this something we want to disallow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it'd be cleaner with SetDefaults_CassandraClusterNodePool.

We need to ensure that there's more than 0 seeds across all nodepools. We could reduce the minimum to 0 in a particular nodepool if we find a use-case for it (I can't find a reason to do it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I've no issue requiring minimum one seed per node pool for now. We can revisit it in future if needed like you say.

Copy link
Contributor Author

@kragniz kragniz Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #272 so we remember


if np.Seeds < 0 {
allErrs = append(allErrs,
field.Invalid(fldPath.Child("seeds"), np.Seeds, "number of seeds must be 1 or greater"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... must be greater than or equal to 1

}

// only label if the current label is incorrect
if pod.Labels["seed"] != desiredLabel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, not something created by this PR, but seed isn't an acceptable label imo. difficult for a user to understand why it is there, and could be removed by accident.

@kragniz kragniz force-pushed the configurable-seed-number branch 2 times, most recently from 2302e30 to f3210eb Compare March 5, 2018 16:12
@kragniz kragniz force-pushed the configurable-seed-number branch 3 times, most recently from d2e1867 to 10b6cc6 Compare March 5, 2018 17:42
Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kragniz

Looks good.

A few questions:

  • I wonder if administrators will know how many seeds they need.
  • Did you consider just hard coding it to 3 seeds per nodepool?
  • Or perhaps making it a seedsPerNodes attribute...which might allow the administrator to ask for e.g. 1 seed per 10 C* nodes?
  • And in that case, perhaps it could be a cluster wide setting rather than a per-nodepool setting?
  • Is there any problem with having different numbers of seeds per rack / DC?

In addition I left a few comments and questions below. Please answer or address those before merging.

@@ -64,6 +64,11 @@ type CassandraClusterNodePool struct {
// in this nodepool. If this is not set, a default will be selected.
// +optional
Datacenter string `json:"datacenter"`

// Seeds specifies the number of seed nodes to alocate in this nodepool. By
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "alocate"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -187,6 +187,15 @@ if [[ "test_elasticsearchcluster" = "${TEST_PREFIX}"* ]]; then
kube_delete_namespace_and_wait "${ES_TEST_NS}"
fi

function apply_cassandracluster() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ The other bash functions have parameters (of sorts) so it might be nice to be consistent. On the other hand it looks like we're moving away from bash based E2E tests to happy for you to leave this for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it'd be nice-to-have a check that all the environment variables that are about to be substituted are actually set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix these e2e things when we move to ginkgo

if ! apply_cassandracluster
then
fail_test "Failed to apply cassandracluster"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ personally, I wouldn't bother with the fail_test part here. I don't regard "applying the desired configuration" as an E2E test....just an implementation detail of the test....so I wouldn't bother adding a specific test error message.
Also, if this command fails, then all bets are off and we should probably just exit the test so that it's easy to debug the problem.
But if you prefer to leave as-is, until we land the gingko test branch, I'm happy with that.

// default to 1 seed if not specified
if np.Seeds == 0 {
np.Seeds = 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the validation prevent seeds: 0 ?

allErrs = append(allErrs,
field.Invalid(fldPath.Child("seeds"), np.Seeds, "number of seeds must be greater than or equal to 1"),
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message says "must be greater than or equal to 1" but the test is np.Seeds < 0

pod, err := c.pods.Pods(cluster.Namespace).Get(fmt.Sprintf("%s-%d", set.Name, i))
if err != nil {
glog.Warningf("Couldn't get stateful set pod: %v", err)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❔ I'm not sure about this. In other controllers we would return this error, but I can see why we'd want to continue and attempt to label other pods if one of the Gets fails.

Could we collect the errors and return an aggregate error at the end?

See k8s.io/apimachinery/pkg/util/errors NewAggregate

set, err := c.statefulSetLister.StatefulSets(cluster.Namespace).Get(setName)
if err != nil {
glog.Warningf("Couldn't get stateful set: %v", err)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an aggregate error here too.

navObjects: []runtime.Object{cluster},
cluster: cluster,
assertions: func(t *testing.T, state *controllers.State) {
CheckSeedLabel(pod2.Name, "", pod2.Namespace, t, state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Add a test for deleting label when the seeds value is reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kragniz
Copy link
Contributor Author

kragniz commented Mar 9, 2018

I've changed Seeds to be an *int64, avoiding the awkward validation/defaulting for a value of 0

@wallrj
Copy link
Member

wallrj commented Mar 13, 2018

That 1.7 E2E test failure is a bit weird.

W0309 11:02:52.659] ++ kubectl run in-cluster-cmd-855 --namespace=test-cassandra-1520593007-14636 --image=cassandra:latest --restart=Never --rm --stdin=true --attach=true --quiet -- /usr/bin/cqlsh cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1'
W0309 11:02:59.367] Connection error: ('Unable to connect to any servers', {'10.0.0.127': error(None, "Tried connecting to [('10.0.0.127', 9042)]. Last error: timed out")})
W0309 11:02:59.907] + actual=
W0309 11:02:59.907] + grep --quiet testvalue1
W0309 11:02:59.909] + local exit_code=1
W0309 11:02:59.909] ++ date +%s
W0309 11:02:59.910] + local current_time=1520593379
W0309 11:02:59.910] + local remaining_time=205
W0309 11:02:59.911] + [[ 205 -le 0 ]]
W0309 11:02:59.911] + local sleep_time=10
W0309 11:02:59.911] + [[ 205 -lt 10 ]]
W0309 11:02:59.911] + sleep 10
W0309 11:03:09.912] + stdout_matches testvalue1 cql_connect test-cassandra-1520593007-14636 cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1'
W0309 11:03:09.912] + local expected=testvalue1
W0309 11:03:09.912] + shift
W0309 11:03:09.912] + local actual
W0309 11:03:09.912] ++ cql_connect test-cassandra-1520593007-14636 cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1'
W0309 11:03:09.912] ++ local namespace=test-cassandra-1520593007-14636
W0309 11:03:09.912] ++ shift
W0309 11:03:09.913] ++ in_cluster_command test-cassandra-1520593007-14636 cassandra:latest /usr/bin/cqlsh cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1'
W0309 11:03:09.913] ++ local namespace=test-cassandra-1520593007-14636
W0309 11:03:09.913] ++ shift
W0309 11:03:09.913] ++ local image=cassandra:latest
W0309 11:03:09.913] ++ shift
W0309 11:03:09.913] ++ kubectl run in-cluster-cmd-18383 --namespace=test-cassandra-1520593007-14636 --image=cassandra:latest --restart=Never --rm --stdin=true --attach=true --quiet -- /usr/bin/cqlsh cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1'
W0309 15:03:13.029] + actual=
W0309 15:03:13.030] + grep --quiet testvalue1
W0309 15:03:13.033] + local exit_code=1
W0309 15:03:13.034] ++ date +%s
W0309 15:03:13.035] + local current_time=1520607793
W0309 15:03:13.035] + local remaining_time=-14209
W0309 15:03:13.035] + [[ -14209 -le 0 ]]
W0309 15:03:13.035] + return 1
W0309 15:03:13.035] + fail_test 'Cassandra data was lost'
W0309 15:03:13.035] + FAILURE_COUNT=1
W0309 15:03:13.035] + echo 'TEST FAILURE: Cassandra data was lost'

The command kubectl run in-cluster-cmd-18383 --namespace=test-cassandra-1520593007-14636 --image=cassandra:latest --restart=Never --rm --stdin=true --attach=true --quiet -- /usr/bin/cqlsh cass-test-cql 9042 --debug '--execute=SELECT * FROM space1.testtable1' appears to have hung from 11:03:09 until 15:03:13

@jetstack-ci-bot
Copy link
Contributor

@kragniz PR needs rebase

pkg/util/util.go Outdated
@@ -11,3 +11,7 @@ func CalculateQuorum(num int32) int32 {
}
return (num / 2) + 1
}

func Int64Ptr(i int64) *int64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed since #296

@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: wallrj

Assign the PR to them by writing /assign @wallrj in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot
Copy link
Collaborator

@kragniz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
navigator-quick-verify 3541a6d link /test verify
navigator-e2e-v1-8 3541a6d link /test e2e v1.8
navigator-e2e-v1-7 3541a6d link /test e2e v1.7
navigator-e2e-v1-9 3541a6d link /test e2e v1.9

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@munnerz munnerz added this to the v0.1 milestone Mar 27, 2018
@munnerz
Copy link
Contributor

munnerz commented Mar 27, 2018

Should we merge this now, or wait until the actions stuff has merged and redesign this for that new structure? (targeting 0.2)

@jetstack-bot
Copy link
Collaborator

@kragniz: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@munnerz munnerz modified the milestones: v0.1, v0.2 Apr 3, 2018
@wallrj wallrj removed this from the v0.2 milestone May 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants