add validation code to check replicas for quorum loss #102

alejandroEsc · 2024-03-26T14:38:06Z

Checking if quorum will be lost if replica count changes.

TODO:

test (kuttl test on a few well known values for replicas.)

https://github.com/orgs/redpanda-data/projects/54/views/8?pane=issue&itemId=56191515

chrisseto · 2024-03-26T15:10:55Z

Could we instead be relying on redpanda to apply back pressure to rpk decommission? IIUC, the we'd need to be worried about more than just quorum but also replication counts. If you have a cluster of 5 brokers and set your topics to be replicated 5 times, you wouldn't be able to scale in to 4 but this check would still pass. I am assuming that the decommission command will refuse to decommission a node if it would risk losing quorum 😅

alejandroEsc · 2024-03-26T15:38:48Z

Could we instead be relying on redpanda to apply back pressure to rpk decommission? IIUC, the we'd need to be worried about more than just quorum but also replication counts. If you have a cluster of 5 brokers and set your topics to be replicated 5 times, you wouldn't be able to scale in to 4 but this check would still pass. I am assuming that the decommission command will refuse to decommission a node if it would risk losing quorum 😅

Oh interesting, that sounds like another problem, one associated with replication of partitions of topics. My understanding of the problem was loosing quorum of the admin API. But this also sounds legit, though i think raft is also implemented on the partitions so you could replicate to 5 and scale down to 4 and still have quorum over your partitions. I suggest that if we want to also account for replication of topics then we should consider that another check to avoid conflating the two issues. The latter check probably belongs to the decommission controller though, since there you have visibility over the scaling.

Note that the decommission controller is a clean up process, it does not stop you from scaling redpanda but cleans up redpanda since it decommissions for you brokers that have already be turned off by the scaling. Here, the best we can do for partitions is try to ensure that quorum can be recovered based on desired partition replicas (largest of the min required for quorum wins), essentially avoid shooting yourself on the foot.

chrisseto · 2024-03-26T17:58:06Z

Gotcha! As long as the decommission controller is a clean up process, we'll have to lean on validation. Ideally, we'd be able to run decommission on however many nodes ahead of changing the StatefulSet's number of replicas. My fear with validation is that we'll accidentally lock ourselves out of some operations in a disaster recovery scenario. We can always add in a backdoor in the future if need be though.

src/go/k8s/internal/controller/redpanda/redpanda_controller.go

RafalKorepta · 2024-03-27T09:31:28Z

I agree with @chrisseto. Decommission process is the right place to give us signal from core Redpanda that the scaling down is not possible. In operator v1 logic the decommission is performed before scaling down, not after. That way we will never break producers that needs to satisfy topic replication factor.

alejandroEsc · 2024-03-27T13:14:14Z

I agree with @chrisseto. Decommission process is the right place to give us signal from core Redpanda that the scaling down is not possible. In operator v1 logic the decommission is performed before scaling down, not after. That way we will never break producers that needs to satisfy topic replication factor.

That seems outside of the scope of what was said on the ticket. I can add the logic here if you want and have two places were we take care of quorum issues, no problem. The first part, here already, basically ensures that admin api does not loose quorum, and this was discussed in private conversation with @RafalKorepta. The second part will try to consider quorum based on partitions on topics, this will be part of the decommission controller, here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

RafalKorepta · 2024-03-27T13:49:55Z

here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

I'm not sure I understand what you are trying to say. By "here" you mean Redpanda controller?

P.S. Could you add test for this feature?

alejandroEsc · 2024-03-28T14:02:34Z

here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

I'm not sure I understand what you are trying to say. By "here" you mean Redpanda controller?

P.S. Could you add test for this feature?

Here: means this PR

for the PS read the TODO.

RafalKorepta · 2024-03-29T08:15:49Z

The tests validate that Redpanda custom resource is working. I'm afraid that due to very simple values the post-install or post-upgrade helm jobs would not give us confidence that Redpanda cluster itself is working correctly. Redpanda controller is not verifying the health of the cluster. It only cares about helm chart install/upgrade success.

RafalKorepta · 2024-03-29T08:20:54Z

In the buildkite CI I saw few errors:

internal/controller/redpanda/redpanda_controller.go:915:10: err113: do not define dynamic errors, use wrapped static errors instead: "fmt.Errorf(\"could not unmarshal values data to validate helmrelease\")" (goerr113)
		return fmt.Errorf("could not unmarshal values data to validate helmrelease")
		       ^
internal/controller/redpanda/redpanda_controller.go:933:10: err113: do not define dynamic errors, use wrapped static errors instead: "fmt.Errorf(\"requested replicas of %d is less than %d neeed to maintain quorum\", requestedReplicas, minForQuorum)" (goerr113)
		return fmt.Errorf("requested replicas of %d is less than %d neeed to maintain quorum", requestedReplicas, minForQuorum)
		       ^
internal/controller/redpanda/redpanda_controller.go:41:2: SA1019: "k8s.io/utils/pointer" is deprecated: Use functions in k8s.io/utils/ptr instead: ptr.To to obtain a pointer, ptr.Deref to dereference a pointer, ptr.Equal to compare dereferenced pointers. (staticcheck)
	"k8s.io/utils/pointer"
	^

https://buildkite.com/redpanda/redpanda-operator/builds/1069#018e8043-2481-46ff-9571-76927c63d83b/242-309

    logger.go:42: 15:08:01 | disable-helm-controllers/2-create-redpanda | test step failed 2-create-redpanda

https://buildkite.com/redpanda/redpanda-operator/builds/1069#018e8043-2484-47e9-b811-79da68391839/240-896

K8S Operator v2 Helm test suite is failing all over the places as it can not install Redpanda helm chart.

CLAassistant · 2024-05-20T18:24:39Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

chore: add validation code to check replicas

507fa17

alejandroEsc added the bug Something isn't working label Mar 26, 2024

alejandroEsc self-assigned this Mar 26, 2024

alejandroEsc requested review from RafalKorepta and chrisseto as code owners March 26, 2024 14:38

RafalKorepta reviewed Mar 27, 2024

View reviewed changes

src/go/k8s/internal/controller/redpanda/redpanda_controller.go Outdated Show resolved Hide resolved

RafalKorepta reviewed Mar 27, 2024

View reviewed changes

src/go/k8s/internal/controller/redpanda/redpanda_controller.go Show resolved Hide resolved

added install-crds to taskfiles

0cfcc1a

alejandroEsc added 2 commits March 27, 2024 10:10

task to install cert-manager added

b449bba

nit fixes

6609343

alejandroEsc mentioned this pull request Mar 28, 2024

Orchestrate scaling down Redpanda resource #90

Open

adding tests, adding condition and events

ace787f

alejandroEsc added 3 commits March 31, 2024 22:14

allow tests to pass

5ad3a66

chore: lint issues

a760503

clean up resources

c978f89

alejandroEsc changed the title ~~[WIP] add validation code to check replicas for quorum loss~~ add validation code to check replicas for quorum loss Apr 2, 2024

alejandroEsc added 3 commits April 1, 2024 21:19

another simple win checking replication factor on topics

ca0dac1

fixing linter

12c5810

fixing linter

9351092

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add validation code to check replicas for quorum loss #102

add validation code to check replicas for quorum loss #102

alejandroEsc commented Mar 26, 2024 •

edited

Loading

chrisseto commented Mar 26, 2024

alejandroEsc commented Mar 26, 2024

chrisseto commented Mar 26, 2024

RafalKorepta commented Mar 27, 2024

alejandroEsc commented Mar 27, 2024

RafalKorepta commented Mar 27, 2024

alejandroEsc commented Mar 28, 2024

RafalKorepta commented Mar 29, 2024

RafalKorepta commented Mar 29, 2024

CLAassistant commented May 20, 2024

add validation code to check replicas for quorum loss #102

Are you sure you want to change the base?

add validation code to check replicas for quorum loss #102

Conversation

alejandroEsc commented Mar 26, 2024 • edited Loading

chrisseto commented Mar 26, 2024

alejandroEsc commented Mar 26, 2024

chrisseto commented Mar 26, 2024

RafalKorepta commented Mar 27, 2024

alejandroEsc commented Mar 27, 2024

RafalKorepta commented Mar 27, 2024

alejandroEsc commented Mar 28, 2024

RafalKorepta commented Mar 29, 2024

RafalKorepta commented Mar 29, 2024

CLAassistant commented May 20, 2024

alejandroEsc commented Mar 26, 2024 •

edited

Loading