Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change information stored in _topic_node_index to avoid oversized alloc #17350

Merged
merged 2 commits into from
May 9, 2024

Conversation

ballard26
Copy link
Contributor

_topic_node_index originally stored a bunch of metadata about every group a node led. However, this information was only ever used in even_topic_distributon_constraint::recommended_reassignment. A method which isn't used anywhere outside of unit tests.

This PR changes the information stored in _topic_node_index to be just the count of leaders on a given node. And removes the implementation of even_topic_distributon_constraint::recommended_reassignment.

Fixes #17349

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 23, 2024

}

return std::nullopt;
vassert(false, "not implemented");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include a clear comment here about why this won't be called? Is it called on other types implementing this interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@BenPope BenPope Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERROR 2024-07-19 07:45:03,854 [shard 0:main] assert - Assert failure: (/var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-0269b6a11fdecc85f-1/redpanda/redpanda/src/v/cluster/scheduling/leader_balancer_constraints.cc:59) 'false' not implemented
 
thanks @BenPope looks like something changed, or the assumptions weren't correct initially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched a little but didn't see a ticket for this. is there one @BenPope?

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like all the code removal, but can you elaborate on why you decided to take this route over just changing the std::vector to chunked_vector?

@ballard26 ballard26 requested a review from ztlpn March 25, 2024 15:36
@ballard26
Copy link
Contributor Author

ballard26 commented Mar 25, 2024

I like all the code removal, but can you elaborate on why you decided to take this route over just changing the std::vector to chunked_vector?

Sure, initially I did just blindly switch it over. Then I realized chunked_vector doesn't have a delete(iterator ..) implementation. So the option was to write an implementation of that or do something else. And even if I had switched over the balancer would still be wasting a lot of time and memory book keeping information it never uses.

I implemented recommend_reassignment for the topic aware constraint with the idea that it could be used as a speed improvement over just randomly selecting reassignments and seeing if they decrease error. In testing it did prove to more quickly decrease error. However, I wasn't able to come up with a good way to combine the recommend_reassignment methods from all constraints in a way that reduced error as far as just randomly selecting and checking reassignments. A big issue with using multiple recommend_reassignment functions is they can loop. I.e, one makes a change then the other reverts it. There are ways to break the loops, but as a result error is reduced far more slowly when compared to just using one recommend_reassignment function on its own. So it never actually was put into use. I still have the data and graphs I generated during the investigation back then if you're interested.

So with that in mind recommend_reassignment is basically dead code as is. And the only thing its doing is creating more book keeping work for the balancer.

Copy link
Contributor

@ztlpn ztlpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, although I wonder what the plan for removing legacy code is. Are we waiting for a specific release when we'll be able to deprecate the greedy strategy?

_topic_node_index[topic_id][r.to.node_id].emplace_back(
moved_group_info.group_id, r.to, std::move(moved_group_info.replicas));
_topic_node_index.at(topic_id).at(r.from.node_id) -= 1;
_topic_node_index.at(topic_id).at(r.to.node_id) += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this second at is supposed to be always successful. Isn't it possible that some nodes didn't have a leader of any partition of that topic previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should always be. even_topic_distributon_constraint::update_index is only used internally and the reassignment is validated by the callers before its called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, we initialize it on (new) line 79. But maybe it's easier for the reader to just use operator[]

_topic_node_index[topic_id][r.to.node_id].emplace_back(
moved_group_info.group_id, r.to, std::move(moved_group_info.replicas));
_topic_node_index.at(topic_id).at(r.from.node_id) -= 1;
_topic_node_index.at(topic_id).at(r.to.node_id) += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, we initialize it on (new) line 79. But maybe it's easier for the reader to just use operator[]

@piyushredpanda
Copy link
Contributor

/ci-repeat

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 5, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/47424#018eac40-c414-4630-a9ad-be99d0604a0e:

"rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes"

new failures in https://buildkite.com/redpanda/redpanda/builds/48832#018f5921-de84-48a7-a7d9-1f0fb99dc2f2:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48832#018f5912-cc81-4f75-9f78-52a8fcdd2b12:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48832#018f5921-de86-4508-bf5b-6d70c6a25e71:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48832#018f5912-cc7f-4db1-b622-860beaa08233:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

@ballard26
Copy link
Contributor Author

/ci-repeat

@piyushredpanda
Copy link
Contributor

Failures should be fixed by #18302

@piyushredpanda piyushredpanda merged commit 8c6bdf3 into redpanda-data:dev May 9, 2024
13 of 17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@piyushredpanda
Copy link
Contributor

/backport v24.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Oversized allocation: 655360 bytes in even_topic_distributon_constraint::rebuild_indexes
7 participants