Retransmission with state counters #181

rustyrussell · 2017-05-24T05:47:42Z

OK, so if we keep rollback and add per-channel reconnect counters, this is what it looks like. I know @adiabat wanted simple retransmit, but I think this is actually simpler.

Caveat: I haven't actually implemented this yet, but I'm about to try to do so...

Roasbeef

On an initial pass, I think this works. It's also much simpler than any of the prior approaches, and I generally think that when designing protocols it's better to be explicit when possible.

I may be able to take a stab at implementing this sometime near the tail end of the next week. When I get to that point, I'll report back with any issues or edge cases I ran into during the implementation process.

Thanks for speccing this out!

Roasbeef · 2017-05-26T21:59:53Z

02-peer-protocol.md

-* `update_` messages: acknowledged by `revoke_and_ack`.
-* `commitment_signed`: acknowledged by `revoke_and_ack`.
-* `revoke_and_ack`: acknowledged by `commitment_signed` or `closing_signed`
-* `shutdown`: acknowledged by `closing_signed`.


Do we no longer care about retransmitting shutdown messages? I guess it wasn't really necessary as upon reconnection, if one still wishes to close down the channel, the should just send another shutdown message and re-start the fee negotiation workflow.

Good point, this should be explicit. We could add it in to the commit counter, but I prefer the simplicity of "always retransmit" and associated "ignore duplicates".

Roasbeef · 2017-05-26T22:02:16Z

02-peer-protocol.md

@@ -919,18 +919,24 @@ independent of particular channels; their transmission requirements
 are covered there, and other than being transmitted after `init` (like
 any message), they are independent of requirements here.

+1. type: 136 (`channel_reestablish`)


Still reading this diff, but so far: I dig this! Such an explicit message helps to sidestep the "The Uncertain Signature Problem" all together.

Roasbeef · 2017-05-26T22:22:03Z

02-peer-protocol.md

+`commitment_signed` messages the receiving node has sent, it MUST send
+a set of `update_` messages against the prior commitment transaction
+followed by `commitment_signed`, otherwise if `commitments_received`
+is not equal to the number of `commitment_signed` messages the


As this indicates that the channel states have desynchronized, or the transmitting party had persistence corruption?

Yeah, something's badly wrong. It's also a big hint that you only have to handle the two cases, and only then until you've received the revoke_and_ack.

If we add multiple commits-in-flight, this "one less" becomes "N less".

Roasbeef · 2017-05-26T23:00:52Z

02-peer-protocol.md

+
+If `commitments_received` is one less than the number of
+`commitment_signed` messages the receiving node has sent, it MUST send
+a set of `update_` messages against the prior commitment transaction


Does this mean only the outbound send of update_* messages, or both inbound (updates I've made) and outbound (updates you've made)?

Outbound. If you've missed inbound, the other node will re-xmit because the commitments_received you send is one too low.

sstone · 2017-05-27T14:19:05Z

02-peer-protocol.md

+`commitments_received` is not equal to the commitment number of the
+last `commitment_signed` message the receiving node has sent, it
+SHOULD fail the channel.
+


What happens if the receiving node does not remember the updates it has sent ? sending a commitment_signed would be illegal I think

Always resend updates then commitment signed. And counters in reestablish msg make it clear whether it received previous or not.

Do you mean that:
"it MUST send a set of update_ messages against the prior commitment transaction followed by commitment_signed"
is actually:
"it MUST re-send the same update_ messages that lead to the prior commitment, followed by the same commitment_signed?

This is particularly important if a node does not simply retransmit the exact same update_ messages as previously sent.

So no you don't mean that!

If we are permissive on what the replacing commitment might be, why don't we just say:

"If commitments_received is one less than the commitment number of the last commitment_signed message the receiving node has sent, it MUST reuse the same commitment number for its next commitment_signed, otherwise if commitments_received is not equal to the commitment number of the last commitment_signed message the receiving node has sent, it SHOULD fail the channel."

I don't get why we would require the receiving node to send a new commitment immediately, when the node might not have anything to send (maybe all htlcs are expired), and it can't be enforced anyway.

@Roasbeef pointed out that in an exact-replay implementation, in case of expired htlcs the counterparty might want to accept them and fail them right away, which is not clear if the spec allows, or if it is considered a protocol violation (BOLT 4 has specific errors for expiry_too_soon but how about an expiry in the past)?

You're right. I'll fix this using your wording.

Now, in practice, you have to remember the state when you send commitment_signed, so you probably just retransmit (that's what I do).

Re: expired HTLCs. The receiver will fail with expiry_too_soon, yes. But note that there's only a narrow window where sender would send them: they had to set their deadline when sending the first commitment_signed message, and that would have expired and forced them to go on-chain. Otherwise, they might have received the commitment_signed, and published the commitment transaction containing the expired HTLC then fulfilled.

sstone · 2017-05-31T16:22:17Z

How are forwarded HTLC resent when a node restarts ?

Suppose we're Bob, with the following setup:
Alice <-- channel #ab --> Bob <-- channel #bc --> Carol

Alice and Bob have both signed HTLC #1 and #2.
Bob has forwarded HTLC #1 and #2 but has not signed anything yet.
Bob's sub-system that handles channel #bc restarts.
iiuc Bob and Carol forgets both HTLCs in channel #bc. Upon reconnection, Bob's counters will match Carol's.
-> How are HTLCs #1 and #2 sent to Carol ?

Bob can probably tell than its #ab channel has HTLCs going to #bc that have not been processed by its #bc channel yet, but this is probably more complex to implement than the 'retransmit'' model (where Bob's #bc channel would just resend unacked messages).

rustyrussell · 2017-06-01T03:48:09Z

Fabrice Drouin <notifications@github.com> writes:

How are forwarded HTLC resent when a node restarts ? Suppose we're Bob, with the following setup: Alice <-- channel #ab --> Bob <-- channel #bc --> Carol Alice and Bob have both signed HTLC #1 and #2. Bob has forwarded HTLC #1 and #2 but has not signed anything yet. Bob's sub-system that handles channel #bc restarts. iiuc Bob and Carol forgets both HTLCs in channel #bc. Upon reconnection, Bob's counters will match Carol's. -> How are HTLCs #1 and #2 sent to Carol ?

Bob will probably retransmit, but certainly has the option of failing the incoming HTLCs instead.

Bob can probably tell than its #ab channel has HTLCs going to #bc that have not been processed by its #bc channel yet, but this is probably more complex to implement than the 'retransmit'' model (where Bob's #bc channel would just resend unacked messages).

That's certainly a valid implementtion, though scan-and-retransmit-as-required is how I implemented reconnection in our old daemon. Otherwise you need to persist that information, which means hitting the database on ever send... Cheers, Rusty.

sstone · 2017-06-01T08:54:42Z

Yes, the "rollback" option mimics the batch protocol: one batch "write" call, vs one write call per send with the "retransmit" option. Same amount of data to write in both cases, but with less calls (though it's probably much less of a win in the case of database calls).

pm47 · 2017-06-15T09:34:37Z

02-peer-protocol.md

+a redundant `funding_locked` if it receives one.
+
+On reconnection, a node MUST transmit `channel_reestablish`
+for each channel, and MUST wait for to receive the other node's


and MUST wait ~~for~~ to receive

On reconnection, a node MUST transmit channel_reestablish for each channel

Is it true even for not-yet-confirmed channels? Must a node send a channel_reestablish with both counters at zero?

Yes. Although I had to fix the definition of the revocations_received; it was defined to be the commit index of last-received revocation, which is undefined. So I added one, and added a revocations_received explanation paragraph.

pm47 · 2017-06-19T16:30:09Z

02-peer-protocol.md

@@ -958,8 +958,8 @@ for each channel, and MUST wait for to receive the other node's
 `channel_reestablish` message before sending any other messages for
 that channel.  The sending node MUST set `commitments_received` to the
 commitment number of the last `commitment_signed` it received, and
-MUST set `revocations_received` to the commitment number of the last
-`revoke_and_ack` message received.
+MUST set `revocations_received` to one greater than the commitment number of the last


~~How about calling the fields next_commitment_number and next_revocation_number? It would solve the initialization problem, and would also fit nicely with the new phrasing in 3f9748b.~~

Edit: nevermind, the sentence confused me a little, but it seems revocations_received will actually always be equal to the number of distinct revoke_and_ack received, so it is fine as is.

I actually changed it to explicitly be the next indices, then read your updated comment before pushing. It makes some sense, and it's consistent between revocations and commitments: this means the commitment value is 1 greater than in the previous definition though (effectively still the total number received, but counting the signature from open/accept too).

rustyrussell · 2017-06-22T00:49:05Z

Pierre-Marie Padiou <notifications@github.com> writes:

pm47 commented on this pull request. > @@ -958,8 +958,8 @@ for each channel, and MUST wait for to receive the other node's `channel_reestablish` message before sending any other messages for that channel. The sending node MUST set `commitments_received` to the commitment number of the last `commitment_signed` it received, and -MUST set `revocations_received` to the commitment number of the last -`revoke_and_ack` message received. +MUST set `revocations_received` to one greater than the commitment number of the last How about calling the fields `next_commitment_number` and `next_revocation_number`? It would solve the initialization problem, and would also fit nicely with the new phrasing in 3f9748b.

Yes, agreed. Here's the new text: On reconnection, a node MUST transmit `channel_reestablish` for each channel, and MUST wait for to receive the other node's `channel_reestablish` message before sending any other messages for that channel. The sending node MUST set `next_local_commitment_number` to the commitment number of the next `commitment_signed` it expects to receive, and MUST set `next_remote_revocation_number` to the commitment number of the next `revoke_and_ack` message it expects to receive. If `next_local_commitment_number` is equal to the commitment number of the last `commitment_signed` message the receiving node has sent, it MUST reuse the same commitment number for its next `commitment_signed`, otherwise if `next_local_commitment_number` is not one greater than the commitment number of the last `commitment_signed` message the receiving node has sent, it SHOULD fail the channel. If `next_remote_revocation_number` is equal to the commitment number of the last `revoke_and_ack` the receiving node has sent, it MUST re-send the `revoke_and_ack`, otherwise if `next_remote_revocation_number` is not equal to one greater than the commitment number of the last `revoke_and_ack` the receiving node has sent (or equal to zero if none have been sent), it SHOULD fail the channel. ... Note that the `next_local_commitment_number` starts at 1 since commitment number 0 is created during opening. `next_remote_revocation_number` will be 0 until the `commitment_signed` for commitment number 1 is received, at which point the revocation for commitment number 0 is sent.

…

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #181 (review)

pm47 · 2017-06-26T14:26:54Z

02-peer-protocol.md

-fail the channel if they occur.
+channel, and the following requirements do not apply.
+
+On reconnection, a node MUST retransmit `funding_locked` unless it has


How about moving this paragraph below the next one, and rephrase it to:
"If next_local_commitment_number is zero in both the channel_reestablish it sent and received, then the node MUST retransmit funding_locked, otherwise it MUST NOT. On reconnection, a node MUST ignore a redundant funding_locked if it receives one."

Would that work? I find it more straightforward to understand and implement, and I think there is an ambiguity in case of discarded update_* in the current version.

Brilliant! ~~Patched exactly like that.~~ 1, not zero (commit 0 is created during opening).

Oops, right

rustyrussell · 2017-06-27T01:25:29Z

I squashed fixme commits, for final review phase.

pm47

Initially I was a bit reluctant with the "rollback" mechanism (A) described in this PR, because it seemed too complex for small gains. I was more in favor of a more basic "simple replay" mechanism (B), which I found more simple to grasp.

Having implemented the two variants:

(B) is indeed conceptually simpler, and easy to debug, but it is actually not easier to implement, because of the need to keep track of all messages. The acknowledgment logic can become quite convoluted (it was in my implementation at least), also filtering out old messages was a real pain and I suspect it opened up a few vulnerabilities
(B) is less performant because we need to write all updates to disk
the rollback logic of (A) is simpler than I thought initially, and with the latest iterations there is no corner case at initialization which is nice
with (A) I still have trouble wrapping my head around the fact that the order of messages is not preserved. For example revoke_and_ack, commitment_signed and shutdown messages can be re-sent in a different order. Even if it appears to work fine, there is still something magical about it. I guess I'll be more confortable with it once I add more test on edge cases.

All things considered I changed my mind and now prefer (A). Thanks for your work!

We can't do that, so allow "write, then send". That fails on the side of timing out, rather than having a channel which can't be used. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This adds a message for each channel reconnect (after we've sent/received `funding_signed`, ie. when we rememeber the channel), which says exactly how many `commitment_signed` and `revoke_and_ack` we've received. Really, we could use one bit for each (they could only be missing the last one), but better to be clear. This leaves the "rollback if didn't get commitment_signed" requirement, but avoids any need to handle update duplicates or wonder what update number a `commitment_signed` applies to after reconnect. Many thanks to pm47 and roasbeef especially for constructive feedback which made this far better than I originally had. Closes: lightning#172 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell requested review from Roasbeef and sstone May 24, 2017 05:47

rustyrussell force-pushed the retransmission-with-state-counters branch 3 times, most recently from 7657d11 to 96e6c1b Compare May 25, 2017 04:47

Roasbeef reviewed May 26, 2017

View reviewed changes

sstone reviewed May 27, 2017

View reviewed changes

cdecker added this to the v1.0 milestone Jun 14, 2017

pm47 reviewed Jun 15, 2017

View reviewed changes

pm47 reviewed Jun 19, 2017

View reviewed changes

pm47 reviewed Jun 26, 2017

View reviewed changes

rustyrussell force-pushed the retransmission-with-state-counters branch from 5f419c2 to f4d870f Compare June 27, 2017 01:25

rustyrussell force-pushed the retransmission-with-state-counters branch from f4d870f to 1197db0 Compare June 27, 2017 01:56

pm47 approved these changes Jun 27, 2017

View reviewed changes

rustyrussell added 2 commits June 28, 2017 06:37

BOLT 2: don't insist we atomically send and commit to disk.

6a867ae

We can't do that, so allow "write, then send". That fails on the side of timing out, rather than having a channel which can't be used. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the retransmission-with-state-counters branch from 1197db0 to 7333105 Compare June 27, 2017 21:09

rustyrussell merged commit ac8b830 into lightning:master Jun 27, 2017

pm47 mentioned this pull request Jul 14, 2017

Implement state-counter retransmission ACINQ/eclair#105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retransmission with state counters #181

Retransmission with state counters #181

rustyrussell commented May 24, 2017

Roasbeef left a comment

Roasbeef May 26, 2017

rustyrussell May 27, 2017

Roasbeef May 26, 2017

Roasbeef May 26, 2017

rustyrussell May 27, 2017

Roasbeef May 26, 2017

rustyrussell May 27, 2017

sstone May 27, 2017

rustyrussell May 29, 2017

pm47 Jun 14, 2017 •

edited

Loading

rustyrussell Jun 17, 2017

sstone commented May 31, 2017 •

edited

Loading

rustyrussell commented Jun 1, 2017 via email

sstone commented Jun 1, 2017

pm47 Jun 15, 2017

pm47 Jun 15, 2017

rustyrussell Jun 17, 2017

pm47 Jun 19, 2017 •

edited

Loading

rustyrussell Jun 22, 2017

rustyrussell commented Jun 22, 2017 via email

pm47 Jun 26, 2017

rustyrussell Jun 27, 2017 •

edited

Loading

pm47 Jun 27, 2017

rustyrussell commented Jun 27, 2017

pm47 left a comment

Retransmission with state counters #181

Retransmission with state counters #181

Conversation

rustyrussell commented May 24, 2017

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pm47 Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sstone commented May 31, 2017 • edited Loading

rustyrussell commented Jun 1, 2017 via email

sstone commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pm47 Jun 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rustyrussell commented Jun 22, 2017 via email

Choose a reason for hiding this comment

rustyrussell Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rustyrussell commented Jun 27, 2017

pm47 left a comment

Choose a reason for hiding this comment

pm47 Jun 14, 2017 •

edited

Loading

sstone commented May 31, 2017 •

edited

Loading

pm47 Jun 19, 2017 •

edited

Loading

rustyrussell Jun 27, 2017 •

edited

Loading