-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require conn_quota::units stay on a single shard #10681
Conversation
dbaa003
to
6baa484
Compare
6baa484
to
6aca002
Compare
/ci-repeat 5 |
/ci-repeat 10 |
/ci-repeat 1 |
Invalid ducktape log level TRACE. Must be one of trace,debug,info,warn,error. |
/ci-repeat 1 |
/ci-repeat 5 |
8f1f396
to
c1d3e27
Compare
c1d3e27
to
024ca60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me from wrong, do_put()
explicitly supports returning units cross-shard. If there is a suspiction that this is somehow correlated with segfaults, maybe it's safer to add warns/errors to the log about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The while conn_quota
supports cross shard, it does not support being used on a different shard than it's home
shard. The stacktrace in the linked bug makes it look like total_home
is a bad pointer or pointing to junk. conn_quota::units
captures a reference to a single shard's conn_quota
which is a way the stacktrace could happen.
total_home
is only defined on shard0, if another shard's service is called on shard0
then it will assume that total_home
is a valid pointer and then dereference nullptr
. The log statement I added should help determine this too.
024ca60
to
5d1feff
Compare
da6ba58
to
65c048b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a theory total_home (only set on shard0's service) could be ran
using a service from another shard, but on shard0.
do you have a candidate execution schedule that would lead to the violation? i recall thinking the same thing a long time ago, but was never able to find a counterexample. i still remained a bit suspicious.
The stack trace in #10544 (comment) hints that the total_home pointer on shard0 is bad from the logs. And we do grab a reference in units to a local shard, so seems like it could be a smoking gun. From my reading of the code it doesn't look like we switch shards however. |
35d7a10
to
7445914
Compare
this is wild. do you think maybe we aren't switching shards, but a pointer is leaking across shards some how? |
Hrm, that seems unlikely? I'm not thinking that directly I supposed. This is more to prevent this from accidentally happening in the future, and guarding against any weird bugs here. |
Upgrades oncore and vlog to use std::source_location, removing __FILE__ and __LINE__ macro usage. This means it's easier to use oncore without a macro. Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
This makes the API contract more explicit since we now require the connection quota doesn't change shards. Related: redpanda-data#10544 Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
I have a theory total_home (only set on shard0's service) could be ran using a service from another shard, but on shard0. We should add logs for what total_allowance is because that's what the stacktrace reported as crashing. Signed-off-by: Tyler Rockwood <rockwood@redpanda.com>
7445914
to
98e5386
Compare
/ci-repeat 10 |
/ci-repeat 10 |
/ci-repeat 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This makes the API contract more explicit since we now require the
connection quota units doesn't change shards.
Related: #10544
Backports Required
Release Notes