-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOLT 1: introduce warning messages, reduce requirements to send (hard) errors #834
Conversation
I've started refactoring clightning internally so all our error routines are flagged soft or hard. I've got a hack where soft errors send out msg number 1 (undefined, currently ignored) instead of 17, which makes it more effective with older clightning which treats all error (17) messages as "hard". Not sure if we should endorse that as a separate "warning" message? The other approach would be to redefine 17 (error) as a soft error and create a new message (2?) as "hard error"? |
Interesting, eclair also defaults to fatal errors and closes channels when they happen, so old eclair will keep treating all errors as fatal regardless of the tlv flag. When the I think a separate (odd) message would be more appropriate, as it would work out of the box for non-upgraded nodes (they would simply ignore it so won't close a channel). For backwards-compatibility we need to keep the existing message for fatal errors, the new ones should be the non-fatal errors. |
Yes, though we've already had to except various errors from failing the channel, since LND sends them (e.g. if you are too slow in reestablishing a channel). Probably because LND didn't close on errors, so it didn't treat them as fatal itself. Let me try a "warning" message instead then... |
e6b540b
to
528805b
Compare
OK, now it's a separate warning message, and I went through and revised and made consistent all the errors/closing/failing language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea, it will be much simpler to notice issues and have users take unilateral actions without needing to contact their peer to understand what went wrong.
Hi all, I started thinking about possible extensions to how the spec deals with errors towards the end of last year, and wanted to drop some thoughts in here. A string based error is nice for humans to read, but doesn’t give code much of a chance of elegantly reacting. If we’re going to make a change now, I think it would be worth considering the following so that we don’t have to come back again to make further updates:
I don’t want to derail conversation here , so will make my own ML post with details if this sounds reasonable. |
I think it could be reasonable for this new warning message (I'm not sure it makes sense for the Do you have some concrete examples of what warnings could trigger an automatic action? TBH it feels like the protocols in the spec should have all cases covered to avoid needing it, but there's always a difference between theory and practice. |
In general, these errors should never occur. It means one side has a bug. Unfortunately, that happens in practice, leading to workarounds where errors are ignored. Hence distinguishing between "Warning: you sent crap" and "Fatal: no point talking any more, I've killed the channel". As we cannot define all possible bugs, the best you can do is log something and retry, and eventually give up (just as you would if there were no progress without any error). There is a strong case of other kinds of errors: in the offers spec I included such things, as they can genuinely happen, and we already have onion errors. We've also talked about soft-errors in the onion context (e.g. "I don't want to worry you, but there's a delay...") |
Right, I think I misunderstood the original reasoning for this warning message - thanks for clarifying! Error codes make less in the infinite range of bugs that one could run into :')
@t-bast I'll make a ML post with my case for these error codes, rather than clogging up this issue more since they are related, but not directly relevant to a soft warning error. |
This takes from the draft spec at lightning/bolts#834 Note that if this draft does not get included, the peer will simply ignore the warning message (we always close the connection afterwards anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: Protocol: we now report the new (draft) warning message.
This takes from the draft spec at lightning/bolts#834 Note that if this draft does not get included, the peer will simply ignore the warning message (we always close the connection afterwards anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: Protocol: we now report the new (draft) warning message.
This takes from the draft spec at lightning/bolts#834 Note that if this draft does not get included, the peer will simply ignore the warning message (we always close the connection afterwards anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: Protocol: we now report the new (draft) warning message.
This takes from the draft spec at lightning/bolts#834 Note that if this draft does not get included, the peer will simply ignore the warning message (we always close the connection afterwards anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: Protocol: we now report the new (draft) warning message.
I'm working on the error pathways for the dual-funding RBF right now. This disambiguation between "warning" and "error" is very useful in that context. Particularly how 'warning/close connection' allows a retry from initial state conditions, in other words signals to abandon the failed RBF attempt (with an explanation as to why.) Error codes are indeed useful for specific, narrowly defined interactions and recovery pathways, like payment attempts. In c-lightning we also encode them for our RPC errors, to help identify common cases. I can see a use case where we'd want error codes specifically for RBF failure cases, however these can be added ad-hoc for specific and well-understood cases in the future. In the general case, returning a warning to the caller with an explanation and re-starting the attempt is a robust solution; error coded response handling logic has a tendency to degrade robustness over time ime. |
This takes from the draft spec at lightning/bolts#834 Note that if this draft does not get included, the peer will simply ignore the warning message (we always close the connection afterwards anyway). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: Protocol: we now report the new (draft) warning message.
Add support for logging warning messages as introduced in lightning/bolts#834 Support for sending warning messages instead of current errors will be added in a later PR.
Add support for logging warning messages as introduced in lightning/bolts#834 Support for sending warning messages instead of current errors will be added in a later PR.
lightning/bolts#834 recommends sending warning messages instead of connection-level errors in some cases, which avoids unnecessary channel closure.
lightning/bolts#834 recommends sending warning messages instead of connection-level errors in some cases, which avoids unnecessary channel closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I finally implemented this in eclair and tested with c-lightning, and we exchange warning messages correctly.
Just needs a rebase and it's an ACK for me!
lightning/bolts#834 recommends sending warning messages instead of connection-level errors in some cases, which avoids unnecessary channel closure.
Needs rebase, otherwise ACK 4f2ec06 |
If the warning message has no context at all, then how is the caller meant to parse the message to restart the attempt? Totally disagree that error code response handling logic degrades over time, when was the last time you needed to redefine how you handle the HTTP 302 error code? |
I really don't understand what you mean. We've already found it very useful many times in the wild compared to "the remote peer just disconnected on me, so something must be wrong with what I sent them but I have no idea what". |
Wen rebase, lets land this :) |
4f2ec06
to
5c745fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 5c745fd, let's land this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically there, want to come to come conclusion on all-0 errors.
01-messaging.md
Outdated
|
||
#### Requirements | ||
|
||
The channel is referred to by `channel_id`, unless `channel_id` is 0 (i.e. all bytes are 0), in which case it refers to all channels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the last thing we need to resolve, I'm still highly unconvinced here - yes, some of the historical places all-0 channel ids have been used was nonsense, but the idea that a peer will have a channel with me and I don't want to talk to them anymore and I'm not able to tell them that seems really wrong? Implementing all the logic to stay connected to them, parse their messages for channel IDs, and then fail those channels seems substantially complicated, and probably not relevant in cases where you just want a peer to "go away". We can change how we communicate this if backwards compatibility is the concern, but it just seems very strange to me to not be able to communicate this at all (when we currently can).
02-peer-protocol.md
Outdated
- if it hasn't sent a `funding_locked` yet: | ||
- MAY reply to a `shutdown` message with a `shutdown` | ||
- once there are no outstanding updates on the peer, UNLESS it has already sent a `shutdown`: | ||
- MUST reply to a `shutdown` message with a `shutdown` | ||
- if both nodes advertised the `option_upfront_shutdown_script` feature, and the receiving node received a non-zero-length `shutdown_scriptpubkey` in `open_channel` or `accept_channel`, and that `shutdown_scriptpubkey` is not equal to `scriptpubkey`: | ||
- MAY send a `warning`. | ||
- MUST fail the connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we getting rid of the "fail the connection" terminology in this pr :)
02-peer-protocol.md
Outdated
- if the `scriptpubkey` is not in one of the above forms: | ||
- SHOULD fail the connection. | ||
- SHOULD send a `warning`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general (including here and the below warning
conversion, but basically everywhere), we should probablyld probably have something that says "and ignore the message" or so. That would imply that the sender can re-send the shutdown
with a different script (or whatever message in most contexts), which seems to be your goal here. Maybe it could be a general comment at the top of bolt 0 or 1 whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with extending the definition
Successfully used this to debug an issue in a C-Lightning unmerged PR from LDK, so I'd call that cross-implementation-more-than-tested :) |
51b8552
to
570d356
Compare
Restored all-zero errors, and trivial rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 570d356, thanks! Let's finally land this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
I pointed out a few stylistic things and asked a few clarification questions.
One thing that I realized towards the end there is often the phrase:
MUST send a
warning
and close the connection, or send anerror
and fail the channel.
I believe this is defined as things a node MUST do after sending warning
or error
so it seems very redundant to repeat this throughout the document.
00-introduction.md
Outdated
@@ -132,6 +132,20 @@ See [BOLT #11: Invoice Protocol for Lightning Payments](11-payment-encoding.md) | |||
* _See related: [closing transaction](#closing-transaction), [funding transaction](#funding-transaction), [penalty transaction](#penalty-transaction)_ | |||
* _See types: [revoked commitment transaction](#revoked-commitment-transaction)_ | |||
|
|||
* #### *Fail the channel*: | |||
* This is a forced close of the channel. Very early on (before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a contradiction to #942 ?
01-messaging.md
Outdated
The 2-byte `len` field indicates the number of bytes in the immediately following field. | ||
1. type: 1 (`warning`) | ||
2. data: | ||
* [`channel_id`:`channel_id`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: This assumes / includes temporary_channel_id
during funding of the channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
01-messaging.md
Outdated
- SHOULD send `error` for protocol violations or internal errors that make channels unusable or that make further communication unusable. | ||
- SHOULD send `error` with the unknown `channel_id` in reply to messages of type `32`-`255` related to unknown channels. | ||
- when sending `error`: | ||
- MUST fail the channel(s) referred to by the error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: referred to by the error
message (pending on: #941 )
the message has only one channel_id
field. BOLT 1 says if the id
is zero it refers to all channels between the peers. We ignore the situation where we have n
channels and want to refer to m
channels with 1 error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd likely require multiple warning messages, one for each channel. Though it's hard to imagine a scenario where m < n and m > 1 channels need to close due to the same issue. The all-channel error was mostly intended to "there's something wrong but I can't identify the channel it's affecting" afaik, so "there's something wrong with one of these m channels" isn't much more useful IMHO.
01-messaging.md
Outdated
- when sending `error`: | ||
- MUST fail the channel(s) referred to by the error message. | ||
- when sending `warning`: | ||
- MAY set `channel_id` to all zero if the warning is not related to a specific channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean the channel_id
in warning
and not the channel_id
of the channels between the peers?
01-messaging.md
Outdated
- MUST fail the channel(s) referred to by the error message. | ||
- when sending `warning`: | ||
- MAY set `channel_id` to all zero if the warning is not related to a specific channel. | ||
- MAY close the connection after sending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that not hold for error
? I understood error
to be the case that that a force close should trigger. I see good reasons that after publishing the commitment transaction a peer might want to close the connection.
01-messaging.md
Outdated
- if no existing channel is referred to by the message: | ||
- MUST fail the channel referred to by `channel_id`, if that channel is with the sending node. | ||
- upon receiving `warning`: | ||
- SHOULD log the message for later diagnosis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "log" mean? Keep in memory? Write to a log file? (IIRC we don't sepc logging in the bolts)
02-peer-protocol.md
Outdated
@@ -484,7 +486,7 @@ A non-funding node (fundee): | |||
transaction after a timeout of 2016 blocks. | |||
|
|||
From the point of waiting for `funding_locked` onward, either node MAY | |||
fail the channel if it does not receive a required response from the | |||
send an `error` and fail the channel if it does not receive a required response from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a contradiction here to add that it MAY fail the channel as for what a node MUST do after sending error
is to fail the channel (BOLT 01 line 341)
02-peer-protocol.md
Outdated
- if the `scriptpubkey` is not in one of the above forms: | ||
- SHOULD fail the connection. | ||
- SHOULD send a `warning`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with extending the definition
02-peer-protocol.md
Outdated
@@ -647,7 +650,8 @@ The sending node: | |||
The receiving node: | |||
- if the `signature` is not valid for either variant of closing transaction | |||
specified in [BOLT #3](03-transactions.md#closing-transaction) OR non-compliant with LOW-S-standard rule<sup>[LOWS](https://github.com/bitcoin/bitcoin/pull/6769)</sup>: | |||
- MUST fail the connection. | |||
- MUST send a `warning` and close the connection, or send an | |||
`error` and fail the channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant "fail the channel"
02-peer-protocol.md
Outdated
@@ -673,7 +677,8 @@ The receiving node: | |||
- MUST propose a `fee_satoshis` in the overlap between received and (about-to-be) sent `fee_range`. | |||
- otherwise, if `fee_satoshis` is not strictly between its last-sent `fee_satoshis` | |||
and its previously-received `fee_satoshis`, UNLESS it has since reconnected: | |||
- SHOULD fail the connection. | |||
- SHOULD send a `warning` and close the connection, or send an | |||
`error` and fail the channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant "fail the channel"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that wording comes several times I stop pointing this out
01-messaging.md
Outdated
- SHOULD send `error` for protocol violations or internal errors that make channels unusable or that make further communication unusable. | ||
- SHOULD send `error` with the unknown `channel_id` in reply to messages of type `32`-`255` related to unknown channels. | ||
- when sending `error`: | ||
- MUST fail the channel(s) referred to by the error message. | ||
- when sending `warning`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were restoring all-0s? Looks like its still removed here, and further down.
570d356
to
58287e2
Compare
If you really want to have a hard error, set the new fatal field. We can still get upset if this happens during testing, and we can still log when they happen, but not break everything quite so easily!This follows the suggestion by @t-bast to create a new message (1) "warning". I've implemented this for c-lightning already as a PR.