Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add validation code to check replicas for quorum loss #102

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

alejandroEsc
Copy link
Contributor

@alejandroEsc alejandroEsc commented Mar 26, 2024

Checking if quorum will be lost if replica count changes.

TODO:

  • test (kuttl test on a few well known values for replicas.)

https://github.com/orgs/redpanda-data/projects/54/views/8?pane=issue&itemId=56191515

@alejandroEsc alejandroEsc added the bug Something isn't working label Mar 26, 2024
@alejandroEsc alejandroEsc self-assigned this Mar 26, 2024
@chrisseto
Copy link
Contributor

Could we instead be relying on redpanda to apply back pressure to rpk decommission? IIUC, the we'd need to be worried about more than just quorum but also replication counts. If you have a cluster of 5 brokers and set your topics to be replicated 5 times, you wouldn't be able to scale in to 4 but this check would still pass. I am assuming that the decommission command will refuse to decommission a node if it would risk losing quorum 😅

@alejandroEsc
Copy link
Contributor Author

Could we instead be relying on redpanda to apply back pressure to rpk decommission? IIUC, the we'd need to be worried about more than just quorum but also replication counts. If you have a cluster of 5 brokers and set your topics to be replicated 5 times, you wouldn't be able to scale in to 4 but this check would still pass. I am assuming that the decommission command will refuse to decommission a node if it would risk losing quorum 😅

Oh interesting, that sounds like another problem, one associated with replication of partitions of topics. My understanding of the problem was loosing quorum of the admin API. But this also sounds legit, though i think raft is also implemented on the partitions so you could replicate to 5 and scale down to 4 and still have quorum over your partitions. I suggest that if we want to also account for replication of topics then we should consider that another check to avoid conflating the two issues. The latter check probably belongs to the decommission controller though, since there you have visibility over the scaling.

Note that the decommission controller is a clean up process, it does not stop you from scaling redpanda but cleans up redpanda since it decommissions for you brokers that have already be turned off by the scaling. Here, the best we can do for partitions is try to ensure that quorum can be recovered based on desired partition replicas (largest of the min required for quorum wins), essentially avoid shooting yourself on the foot.

@chrisseto
Copy link
Contributor

Gotcha! As long as the decommission controller is a clean up process, we'll have to lean on validation. Ideally, we'd be able to run decommission on however many nodes ahead of changing the StatefulSet's number of replicas. My fear with validation is that we'll accidentally lock ourselves out of some operations in a disaster recovery scenario. We can always add in a backdoor in the future if need be though.

@RafalKorepta
Copy link
Contributor

I agree with @chrisseto. Decommission process is the right place to give us signal from core Redpanda that the scaling down is not possible. In operator v1 logic the decommission is performed before scaling down, not after. That way we will never break producers that needs to satisfy topic replication factor.

@alejandroEsc
Copy link
Contributor Author

I agree with @chrisseto. Decommission process is the right place to give us signal from core Redpanda that the scaling down is not possible. In operator v1 logic the decommission is performed before scaling down, not after. That way we will never break producers that needs to satisfy topic replication factor.

That seems outside of the scope of what was said on the ticket. I can add the logic here if you want and have two places were we take care of quorum issues, no problem. The first part, here already, basically ensures that admin api does not loose quorum, and this was discussed in private conversation with @RafalKorepta. The second part will try to consider quorum based on partitions on topics, this will be part of the decommission controller, here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

@RafalKorepta
Copy link
Contributor

here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

I'm not sure I understand what you are trying to say. By "here" you mean Redpanda controller?

P.S. Could you add test for this feature?

@alejandroEsc
Copy link
Contributor Author

here the best we can do here is not decomission nodes if this will push any partition out of quorum, and we will re-queue instead, giving the user the opportunity to fix their issue.

I'm not sure I understand what you are trying to say. By "here" you mean Redpanda controller?

P.S. Could you add test for this feature?

Here: means this PR

for the PS read the TODO.

@RafalKorepta
Copy link
Contributor

The tests validate that Redpanda custom resource is working. I'm afraid that due to very simple values the post-install or post-upgrade helm jobs would not give us confidence that Redpanda cluster itself is working correctly. Redpanda controller is not verifying the health of the cluster. It only cares about helm chart install/upgrade success.

@RafalKorepta
Copy link
Contributor

In the buildkite CI I saw few errors:

internal/controller/redpanda/redpanda_controller.go:915:10: err113: do not define dynamic errors, use wrapped static errors instead: "fmt.Errorf(\"could not unmarshal values data to validate helmrelease\")" (goerr113)
		return fmt.Errorf("could not unmarshal values data to validate helmrelease")
		       ^
internal/controller/redpanda/redpanda_controller.go:933:10: err113: do not define dynamic errors, use wrapped static errors instead: "fmt.Errorf(\"requested replicas of %d is less than %d neeed to maintain quorum\", requestedReplicas, minForQuorum)" (goerr113)
		return fmt.Errorf("requested replicas of %d is less than %d neeed to maintain quorum", requestedReplicas, minForQuorum)
		       ^
internal/controller/redpanda/redpanda_controller.go:41:2: SA1019: "k8s.io/utils/pointer" is deprecated: Use functions in k8s.io/utils/ptr instead: ptr.To to obtain a pointer, ptr.Deref to dereference a pointer, ptr.Equal to compare dereferenced pointers. (staticcheck)
	"k8s.io/utils/pointer"
	^

https://buildkite.com/redpanda/redpanda-operator/builds/1069#018e8043-2481-46ff-9571-76927c63d83b/242-309

    logger.go:42: 15:08:01 | disable-helm-controllers/2-create-redpanda | test step failed 2-create-redpanda

https://buildkite.com/redpanda/redpanda-operator/builds/1069#018e8043-2484-47e9-b811-79da68391839/240-896

K8S Operator v2 Helm test suite is failing all over the places as it can not install Redpanda helm chart.

@alejandroEsc alejandroEsc changed the title [WIP] add validation code to check replicas for quorum loss add validation code to check replicas for quorum loss Apr 2, 2024
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants