Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1: Prevent configs from becoming outdated when scaling down #139

Merged
merged 3 commits into from
May 15, 2024

Conversation

chrisseto
Copy link
Contributor

@chrisseto chrisseto commented May 13, 2024

0dd40b5 move setup-envtest to nix

This commit moves the setup-envtest shell scripts downloaded by
taskfile to the go binary equivalent and instead utilizes nix to install
it.

c84a283 V1: Prevent configs from becoming outdated when scaling down

Previously a collection of configurations (Seed Brokers, PandaProxy
Client, and SchemaRegistry Client) could become out of date when scaling
a cluster down. For example 5 replicas to 3. A special case of the
config hasher would ignore these fields to prevent needless restarts in
the case of scaling up which left stale values when scaling down. These
stale configurations would prevent various components of redpanda from
working correctly.

To mitigate these issues, this commit:

  • Changes the PandaProxy and SchemaRegistry clients configurations to
    instead always point at redpanda's headless service.
  • Caps the list of seed brokers to a maximum of 3.

While these changes will prevent needless restarts in the future, the
upgrade to this new configuration WILL result in a single rolling
restart.

Fixes #125

@chrisseto chrisseto self-assigned this May 13, 2024
@chrisseto chrisseto changed the title move setup-envtest to nix V1: Prevent configs from becoming outdated when scaling down May 13, 2024

const maxSeedServers = 3

for i := int32(0); i < replicas && i < maxSeedServers; i++ {
Copy link
Member

@BenPope BenPope May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get the set of replicas, are they in any particular order?

  1. When scaling down, will the first replicas in this list be the ones that remain?
  2. When doing a rolling upgrade of infrastructure (e.g., new K8s version), will this list be sufficient if the nodes are all going away?
  3. Is cr.EmptySeedStartsCluster = False?
  4. How does this work when there is multi-AZ?
  5. I don't remember what happens when I node in the seed_server list can't be found, is that why the list has been truncated to 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When scaling down, will the first replicas in this list be the ones that remain?

Yes, this list will always be replicas 0, 1, and 2. Thanks to the STS's ordering guarantees we can rely on them always being around.

When doing a rolling upgrade of infrastructure (e.g., new K8s version), will this list be sufficient if the nodes are all going away?

The listed nodes will always exist (again STS ordinal system). It should be fine as long as nodes are migrated one at a time and this wouldn't really function any differently than what was previously present. Especially in the case of 3 node clusters.

Is cr.EmptySeedStartsCluster = False?

That's handled on the line above and it looks like it's always set to false if it's a supported redpanda version.

How does this work when there is multi-AZ?

I don't really see how that's relevant but it would function the same way it previously worked. This logic only capped the length of the list. The contents is otherwise computed the exact same way.

@chrisseto chrisseto force-pushed the chris/fix-scale-in branch 2 times, most recently from 17047b4 to 805a5c2 Compare May 13, 2024 21:20
@chrisseto chrisseto force-pushed the chris/fix-scale-in branch 2 times, most recently from 62b4fff to 1167565 Compare May 14, 2024 19:26
@chrisseto
Copy link
Contributor Author

The failure is from decommission-on-delete :/ I'll try running it a few more times but It might be time to skip that test and toss an issue in the backlog.

This commit moves the `setup-envtest` shell scripts downloaded by
taskfile to the go binary equivalent and instead utilizes nix to install
it.
Previously a collection of configurations (Seed Brokers, PandaProxy
Client, and SchemaRegistry Client) could become out of date when scaling
a cluster down. For example 5 replicas to 3. A special case of the
config hasher would ignore these fields to prevent needless restarts in
the case of scaling up which left stale values when scaling down. These
stale configurations would prevent various components of redpanda from
working correctly.

To mitigate these issues, this commit:
- Changes the PandaProxy and SchemaRegistry clients configurations to
instead always point at redpanda's headless service.
- Caps the list of seed brokers to a maximum of 3.

While these changes will prevent needless restarts in the future, the
upgrade to this new configuration WILL result in a single rolling
restart.

Fixes #125
Current tests are asserting old behavior of brokers list defined in pandaproxy
client. With this commit the headless service address will be compared with
Redpanda configuration.
@RafalKorepta
Copy link
Contributor

Currently kuttl/harness/additional-configuration end to end test is failing.

https://buildkite.com/redpanda/redpanda-operator/builds/1440#018f7b4f-7f3a-45da-80a2-cf015134e2a2/1262-4470

logger.go:42: 09:41:07 \| additional-configuration/3-verify-config-changed \| Difference:
--
  | logger.go:42: 09:41:07 \| additional-configuration/3-verify-config-changed \| 13c13
  | logger.go:42: 09:41:07 \| additional-configuration/3-verify-config-changed \| <   - address: additional-configuration.redpanda-system.svc.cluster.local.
  | logger.go:42: 09:41:07 \| additional-configuration/3-verify-config-changed \| ---
  | logger.go:42: 09:41:07 \| additional-configuration/3-verify-config-changed \| >   - address: additional-configuration-0.additional-configuration.redpanda-system.svc.cluster.local.

It should be fixed with 1844219

@chrisseto
Copy link
Contributor Author

Dang! I missed that one. Thanks for fixing!

@chrisseto chrisseto merged commit ad3ea63 into main May 15, 2024
5 checks passed
@chrisseto chrisseto deleted the chris/fix-scale-in branch May 15, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scaling down cluster does not update the node config
3 participants