-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[client]: Fix for incorrectly dropped consensus messages (#11082) #11086
[client]: Fix for incorrectly dropped consensus messages (#11082) #11086
Conversation
It looks like @dforsten signed our Contributor License Agreement. 👍 Many thanks, Parity Technologies CLA Bot |
} | ||
|
||
impl IoChannelQueue { | ||
pub fn new(limit: usize) -> Self { | ||
let limit = i64::try_from(limit).unwrap_or(i64::max_value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider changing the input arg to isize
rather than casting it? Seems like there are 3 call sites in total (in Client::new()
) so perhaps changing the type is not too bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the input arg type does spread into a couple of corners in the code, especially into client settings.
It is an option of course - on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.
Using a signed integer in IoChannelQueue is a local implementation strategy to ensure the queue size counter checks work correctly in a multi-threaded environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.
That could be handled when reading the user provided config, i.e. continue using an unsigned integer in configs and convert to signed there rather than for each message.
currently_queued: Arc<AtomicUsize>, | ||
limit: usize, | ||
currently_queued: Arc<AtomicI64>, | ||
limit: i64, | ||
} | ||
|
||
impl IoChannelQueue { | ||
pub fn new(limit: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your code, but maybe this can be a private method?
Could you elaborate how's that possible? I thought if Using a signed integer is a clever trick, but seems more like a hack than a proper solution to me. |
Have you been able to reproduce the issue with the initial solution? |
Reading the implementation of send(msg) I am not at all certain that if it returns an error the message will not be executed. If there is an absolute guarantee that this is the case then we can implement the solution I originally suggested. |
Looking at the send() impl it seems like one of the two code paths – In other words: I think your original suggestion was the right one. |
The code section I am most unsure if it will return an error even if the message will eventually be sent is this one in channel.rs of mio:
|
Help me check if I understand this right: Afaict if When I think I understand what you mean by the original solution not being the right one: we'd increment before sending, send the message and still get an error back. Now we'd do two decrements (one from the message closure and the other in the Sounds like your second solution with the signed int is the least worst. :/ |
With this PR we could end up with negative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long-term we want to rewrite the whole mio-thing, and since we're not confident try_send
won't execute the message on error, it seems like the simplest workaround atm.
@@ -2739,12 +2740,13 @@ fn transaction_receipt( | |||
|
|||
/// Queue some items to be processed by IO client. | |||
struct IoChannelQueue { | |||
currently_queued: Arc<AtomicUsize>, | |||
limit: usize, | |||
currently_queued: Arc<AtomicI64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docs why we're using signed integers here, so it won't get accidentally reverted in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation to the currently_queued
field to clarify why signed integers are being used here.
Indeed. And on that note it's a bit weird to send in an "item count" from the outside to the queue. The queue itself should keep count of how many messages it has pending/sent/failed etc. The only callsite where |
Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.
The only effect a negative number would have is that the queue size limit would effectively become larger. Before this fix it would straight away fail in that case, and not forward consensus messages at all. |
Yes, I am clear on that. It'd still be good to avoid negative numbers I think. If you disagree, do you mind explaining why? |
Negative numbers could appear only temporarily due to race condition, and if we reset them to zero, |
But with this in mind, both the old and the signed implementation are still incorrect and could underflow. |
That is correct, if send() returns an Error without dropping the message the counter will decrement by one each time. With unsigned numbers the effect is an immediate underflow, with 64bit signed numbers it will take a very long time until it underflows. If that case happens a lot or constantly the validator node should be considered faulty though since consensus messages will likely take a long/indefinite time to be processed. |
Is there still anything I should change on my end to get this PR merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of a hack, but it is (much) better than what we have.
Thank you for taking the time, much appreciated.
* ropsten #6631425 foundation #8798209 (#11201) * [stable] builtin, istanbul and mordor testnet backports (#11234) * ethcore-builtin (#10850) * [builtin]: support `multiple prices and activations` in chain spec (#11039) * [chain specs]: activate `Istanbul` on mainnet (#11228) * ethcore/res: add mordor testnet configuration (#11200) * Update list of bootnodes for xDai chain (#11236) * ethcore: remove `test-helper feat` from build (#11047) * Secret store: fix Instant::now() related race in net_keep_alive (#11155) (#11159) * [stable]: backport #10691 and #10683 (#11143) * Fix compiler warning (that will become an error) (#10683) * Refactor Clique stepping (#10691) * Add Constantinople eips to the dev (instant_seal) config (#10809) * Add cargo-remote dir to .gitignore (?) * Insert explicit warning into the panic hook (#11225) * Fix docker centos build (#11226) * Update MIX bootnodes. (#11203) * Use provided usd-per-eth value if an endpoint is specified (#11209) * Add new line after writing block to hex file. (#10984) * Type annotation for next_key() matching of json filter options (#11192) (but no `FilterOption` in 2.5 so…) * Upgrade jsonrpc to latest (#11206) * [CI] check evmbin build (#11096) * Correct EIP-712 encoding (#11092) * [client]: Fix for incorrectly dropped consensus messages (#11086) * Fix block detail updating (#11015) * Switching sccache from local to Redis (#10971) * Made ecrecover implementation trait public (#11188) * [dependencies]: jsonrpc `14.0.1` (#11183) * [receipt]: add `sender` & `receiver` to `RichReceipts` (#11179) * [ethcore/builtin]: do not panic in blake2pricer on short input (#11180) * util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175) * ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172) * Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127) * Cleanup stratum a bit (#11161) * Upgrade to jsonrpc v14 (#11151) * SecretStore: expose restore_key_public in HTTP API (#10241)
* [CI] check evmbin build (#11096) * Correct EIP-712 encoding (#11092) * [client]: Fix for incorrectly dropped consensus messages (#11082) (#11086) * Update hardcoded headers (foundation, classic, kovan, xdai, ewc, ...) (#11053) * Add cargo-remote dir to .gitignore (?) * Update light client headers: ropsten 6631425 foundation 8798209 (#11201) * Update list of bootnodes for xDai chain (#11236) * ethcore/res: add mordor testnet configuration (#11200) * [chain specs]: activate Istanbul on mainnet (#11228) * [builtin]: support multiple prices and activations in chain spec (#11039) * [receipt]: add sender & receiver to RichReceipts (#11179) * [ethcore/builtin]: do not panic in blake2pricer on short input (#11180) * Made ecrecover implementation trait public (#11188) * Fix docker centos build (#11226) * Update MIX bootnodes. (#11203) * Insert explicit warning into the panic hook (#11225) * Use provided usd-per-eth value if an endpoint is specified (#11209) * Cleanup stratum a bit (#11161) * Add Constantinople EIPs to the dev (instant_seal) config (#10809) (already backported) * util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175) * ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172) * Type annotation for next_key() matching of json filter options (#11192) * Upgrade jsonrpc to latest (#11206) * [dependencies]: jsonrpc 14.0.1 (#11183) * Upgrade to jsonrpc v14 (#11151) * Switching sccache from local to Redis (#10971) * Snapshot restoration overhaul (#11219) * Add new line after writing block to hex file. (#10984) * Pause pruning while snapshotting (#11178) * Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127) * Fix block detail updating (#11015) * Make InstantSeal Instant again #11186 * Filter out some bad ropsten warp snapshots (#11247)
Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.
The initial solution suggested in #11082 was dropped in favor of converting the atomic counter to a signed integer for robustness. There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.
The fix requires a slight limitation of the maximum size of the queue, a 63bit positive integer range should however be well large enough for its purposes.