-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retransmission with state counters #181
Retransmission with state counters #181
Conversation
7657d11
to
96e6c1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On an initial pass, I think this works. It's also much simpler than any of the prior approaches, and I generally think that when designing protocols it's better to be explicit when possible.
I may be able to take a stab at implementing this sometime near the tail end of the next week. When I get to that point, I'll report back with any issues or edge cases I ran into during the implementation process.
Thanks for speccing this out!
* `update_` messages: acknowledged by `revoke_and_ack`. | ||
* `commitment_signed`: acknowledged by `revoke_and_ack`. | ||
* `revoke_and_ack`: acknowledged by `commitment_signed` or `closing_signed` | ||
* `shutdown`: acknowledged by `closing_signed`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we no longer care about retransmitting shutdown messages? I guess it wasn't really necessary as upon reconnection, if one still wishes to close down the channel, the should just send another shutdown
message and re-start the fee negotiation workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this should be explicit. We could add it in to the commit counter, but I prefer the simplicity of "always retransmit" and associated "ignore duplicates".
@@ -919,18 +919,24 @@ independent of particular channels; their transmission requirements | |||
are covered there, and other than being transmitted after `init` (like | |||
any message), they are independent of requirements here. | |||
|
|||
1. type: 136 (`channel_reestablish`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading this diff, but so far: I dig this! Such an explicit message helps to sidestep the "The Uncertain Signature Problem" all together.
02-peer-protocol.md
Outdated
`commitment_signed` messages the receiving node has sent, it MUST send | ||
a set of `update_` messages against the prior commitment transaction | ||
followed by `commitment_signed`, otherwise if `commitments_received` | ||
is not equal to the number of `commitment_signed` messages the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this indicates that the channel states have desynchronized, or the transmitting party had persistence corruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, something's badly wrong. It's also a big hint that you only have to handle the two cases, and only then until you've received the revoke_and_ack.
If we add multiple commits-in-flight, this "one less" becomes "N less".
02-peer-protocol.md
Outdated
|
||
If `commitments_received` is one less than the number of | ||
`commitment_signed` messages the receiving node has sent, it MUST send | ||
a set of `update_` messages against the prior commitment transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean only the outbound send of update_*
messages, or both inbound (updates I've made) and outbound (updates you've made)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outbound. If you've missed inbound, the other node will re-xmit because the commitments_received
you send is one too low.
02-peer-protocol.md
Outdated
`commitments_received` is not equal to the commitment number of the | ||
last `commitment_signed` message the receiving node has sent, it | ||
SHOULD fail the channel. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the receiving node does not remember the updates it has sent ? sending a commitment_signed
would be illegal I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always resend updates then commitment signed. And counters in reestablish msg make it clear whether it received previous or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that:
"it MUST send a set of update_
messages against the prior commitment transaction followed by commitment_signed
"
is actually:
"it MUST re-send the same update_
messages that lead to the prior commitment, followed by the same commitment_signed
?
This is particularly important if a node does not simply retransmit the exact same
update_
messages as previously sent.
So no you don't mean that!
If we are permissive on what the replacing commitment might be, why don't we just say:
"If commitments_received is one less than the commitment number of the last commitment_signed message the receiving node has sent, it MUST reuse the same commitment number for its next commitment_signed, otherwise if commitments_received is not equal to the commitment number of the last commitment_signed message the receiving node has sent, it SHOULD fail the channel."
I don't get why we would require the receiving node to send a new commitment immediately, when the node might not have anything to send (maybe all htlcs are expired), and it can't be enforced anyway.
@Roasbeef pointed out that in an exact-replay implementation, in case of expired htlcs the counterparty might want to accept them and fail them right away, which is not clear if the spec allows, or if it is considered a protocol violation (BOLT 4 has specific errors for expiry_too_soon
but how about an expiry in the past)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll fix this using your wording.
Now, in practice, you have to remember the state when you send commitment_signed, so you probably just retransmit (that's what I do).
Re: expired HTLCs. The receiver will fail with expiry_too_soon, yes. But note that there's only a narrow window where sender would send them: they had to set their deadline when sending the first commitment_signed message, and that would have expired and forced them to go on-chain. Otherwise, they might have received the commitment_signed, and published the commitment transaction containing the expired HTLC then fulfilled.
How are forwarded HTLC resent when a node restarts ? Suppose we're Bob, with the following setup: Alice and Bob have both signed HTLC Bob can probably tell than its #ab channel has HTLCs going to #bc that have not been processed by its #bc channel yet, but this is probably more complex to implement than the 'retransmit'' model (where Bob's #bc channel would just resend unacked messages). |
Fabrice Drouin <notifications@github.com> writes:
How are forwarded HTLC resent when a node restarts ?
Suppose we're Bob, with the following setup:
Alice <-- channel #ab --> Bob <-- channel #bc --> Carol
Alice and Bob have both signed HTLC #1 and #2.
Bob has forwarded HTLC #1 and #2 but has not signed anything yet.
Bob's sub-system that handles channel #bc restarts.
iiuc Bob and Carol forgets both HTLCs in channel #bc. Upon reconnection, Bob's counters will match Carol's.
-> How are HTLCs #1 and #2 sent to Carol ?
Bob will probably retransmit, but certainly has the option of failing
the incoming HTLCs instead.
Bob can probably tell than its #ab channel has HTLCs going to #bc that have not been processed by its #bc channel yet, but this is probably more complex to implement than the 'retransmit'' model (where Bob's #bc channel would just resend unacked messages).
That's certainly a valid implementtion, though
scan-and-retransmit-as-required is how I implemented reconnection in our
old daemon. Otherwise you need to persist that information, which means
hitting the database on ever send...
Cheers,
Rusty.
|
Yes, the "rollback" option mimics the batch protocol: one batch "write" call, vs one write call per send with the "retransmit" option. Same amount of data to write in both cases, but with less calls (though it's probably much less of a win in the case of database calls). |
02-peer-protocol.md
Outdated
a redundant `funding_locked` if it receives one. | ||
|
||
On reconnection, a node MUST transmit `channel_reestablish` | ||
for each channel, and MUST wait for to receive the other node's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and MUST wait
forto receive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reconnection, a node MUST transmit
channel_reestablish
for each channel
Is it true even for not-yet-confirmed channels? Must a node send a channel_reestablish
with both counters at zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Although I had to fix the definition of the revocations_received; it was defined to be the commit index of last-received revocation, which is undefined. So I added one, and added a revocations_received explanation paragraph.
02-peer-protocol.md
Outdated
@@ -958,8 +958,8 @@ for each channel, and MUST wait for to receive the other node's | |||
`channel_reestablish` message before sending any other messages for | |||
that channel. The sending node MUST set `commitments_received` to the | |||
commitment number of the last `commitment_signed` it received, and | |||
MUST set `revocations_received` to the commitment number of the last | |||
`revoke_and_ack` message received. | |||
MUST set `revocations_received` to one greater than the commitment number of the last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling the fields next_commitment_number
and next_revocation_number
? It would solve the initialization problem, and would also fit nicely with the new phrasing in 3f9748b.
Edit: nevermind, the sentence confused me a little, but it seems revocations_received
will actually always be equal to the number of distinct revoke_and_ack
received, so it is fine as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually changed it to explicitly be the next indices, then read your updated comment before pushing. It makes some sense, and it's consistent between revocations and commitments: this means the commitment value is 1 greater than in the previous definition though (effectively still the total number received, but counting the signature from open/accept too).
Pierre-Marie Padiou <notifications@github.com> writes:
pm47 commented on this pull request.
> @@ -958,8 +958,8 @@ for each channel, and MUST wait for to receive the other node's
`channel_reestablish` message before sending any other messages for
that channel. The sending node MUST set `commitments_received` to the
commitment number of the last `commitment_signed` it received, and
-MUST set `revocations_received` to the commitment number of the last
-`revoke_and_ack` message received.
+MUST set `revocations_received` to one greater than the commitment number of the last
How about calling the fields `next_commitment_number` and `next_revocation_number`? It would solve the initialization problem, and would also fit nicely with the new phrasing in 3f9748b.
Yes, agreed.
Here's the new text:
On reconnection, a node MUST transmit `channel_reestablish`
for each channel, and MUST wait for to receive the other node's
`channel_reestablish` message before sending any other messages for
that channel. The sending node MUST set `next_local_commitment_number` to the
commitment number of the next `commitment_signed` it expects to receive, and
MUST set `next_remote_revocation_number` to the commitment number of the
next `revoke_and_ack` message it expects to receive.
If `next_local_commitment_number` is equal to the commitment number of
the last `commitment_signed` message the receiving node has sent, it
MUST reuse the same commitment number for its next `commitment_signed`,
otherwise if `next_local_commitment_number` is not one greater than the commitment number of the
last `commitment_signed` message the receiving node has sent, it
SHOULD fail the channel.
If `next_remote_revocation_number` is equal to the commitment number of
the last `revoke_and_ack` the receiving node has sent, it MUST re-send
the `revoke_and_ack`, otherwise if `next_remote_revocation_number` is not
equal to one greater than the commitment number of the last `revoke_and_ack` the
receiving node has sent (or equal to zero if none have been sent), it SHOULD fail the channel.
...
Note that the `next_local_commitment_number` starts at 1 since
commitment number 0 is created during opening.
`next_remote_revocation_number` will be 0 until the
`commitment_signed` for commitment number 1 is received, at which
point the revocation for commitment number 0 is sent.
…
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#181 (review)
|
02-peer-protocol.md
Outdated
fail the channel if they occur. | ||
channel, and the following requirements do not apply. | ||
|
||
On reconnection, a node MUST retransmit `funding_locked` unless it has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this paragraph below the next one, and rephrase it to:
"If next_local_commitment_number
is zero in both the channel_reestablish
it sent and received, then the node MUST retransmit funding_locked
, otherwise it MUST NOT. On reconnection, a node MUST ignore a redundant funding_locked
if it receives one."
Would that work? I find it more straightforward to understand and implement, and I think there is an ambiguity in case of discarded update_*
in the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant! Patched exactly like that. 1, not zero (commit 0 is created during opening).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, right
5f419c2
to
f4d870f
Compare
I squashed fixme commits, for final review phase. |
f4d870f
to
1197db0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was a bit reluctant with the "rollback" mechanism (A) described in this PR, because it seemed too complex for small gains. I was more in favor of a more basic "simple replay" mechanism (B), which I found more simple to grasp.
Having implemented the two variants:
- (B) is indeed conceptually simpler, and easy to debug, but it is actually not easier to implement, because of the need to keep track of all messages. The acknowledgment logic can become quite convoluted (it was in my implementation at least), also filtering out old messages was a real pain and I suspect it opened up a few vulnerabilities
- (B) is less performant because we need to write all updates to disk
- the rollback logic of (A) is simpler than I thought initially, and with the latest iterations there is no corner case at initialization which is nice
- with (A) I still have trouble wrapping my head around the fact that the order of messages is not preserved. For example
revoke_and_ack
,commitment_signed
andshutdown
messages can be re-sent in a different order. Even if it appears to work fine, there is still something magical about it. I guess I'll be more confortable with it once I add more test on edge cases.
All things considered I changed my mind and now prefer (A). Thanks for your work!
We can't do that, so allow "write, then send". That fails on the side of timing out, rather than having a channel which can't be used. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This adds a message for each channel reconnect (after we've sent/received `funding_signed`, ie. when we rememeber the channel), which says exactly how many `commitment_signed` and `revoke_and_ack` we've received. Really, we could use one bit for each (they could only be missing the last one), but better to be clear. This leaves the "rollback if didn't get commitment_signed" requirement, but avoids any need to handle update duplicates or wonder what update number a `commitment_signed` applies to after reconnect. Many thanks to pm47 and roasbeef especially for constructive feedback which made this far better than I originally had. Closes: lightning#172 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1197db0
to
7333105
Compare
OK, so if we keep rollback and add per-channel reconnect counters, this is what it looks like. I know @adiabat wanted simple retransmit, but I think this is actually simpler.
Caveat: I haven't actually implemented this yet, but I'm about to try to do so...