Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOLT 1: introduce warning messages, reduce requirements to send (hard) errors #834

Closed
wants to merge 0 commits into from

Conversation

rustyrussell
Copy link
Collaborator

@rustyrussell rustyrussell commented Jan 20, 2021

If you really want to have a hard error, set the new fatal field. We can still get upset if this happens during testing, and we can still log when they happen, but not break everything quite so easily!

This follows the suggestion by @t-bast to create a new message (1) "warning". I've implemented this for c-lightning already as a PR.

@rustyrussell
Copy link
Collaborator Author

I've started refactoring clightning internally so all our error routines are flagged soft or hard. I've got a hack where soft errors send out msg number 1 (undefined, currently ignored) instead of 17, which makes it more effective with older clightning which treats all error (17) messages as "hard".

Not sure if we should endorse that as a separate "warning" message?

The other approach would be to redefine 17 (error) as a soft error and create a new message (2?) as "hard error"?

@t-bast
Copy link
Collaborator

t-bast commented Jan 20, 2021

Interesting, eclair also defaults to fatal errors and closes channels when they happen, so old eclair will keep treating all errors as fatal regardless of the tlv flag. When the fatal TLV is missing, updated nodes will still want to consider these errors fatal and close the channel.

I think a separate (odd) message would be more appropriate, as it would work out of the box for non-upgraded nodes (they would simply ignore it so won't close a channel). For backwards-compatibility we need to keep the existing message for fatal errors, the new ones should be the non-fatal errors.

@rustyrussell
Copy link
Collaborator Author

Yes, though we've already had to except various errors from failing the channel, since LND sends them (e.g. if you are too slow in reestablishing a channel). Probably because LND didn't close on errors, so it didn't treat them as fatal itself.

Let me try a "warning" message instead then...

@rustyrussell
Copy link
Collaborator Author

OK, now it's a separate warning message, and I went through and revised and made consistent all the errors/closing/failing language.

Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea, it will be much simpler to notice issues and have users take unilateral actions without needing to contact their peer to understand what went wrong.

01-messaging.md Outdated Show resolved Hide resolved
03-transactions.md Outdated Show resolved Hide resolved
@carlaKC
Copy link
Contributor

carlaKC commented Jan 21, 2021

Hi all, I started thinking about possible extensions to how the spec deals with errors towards the end of last year, and wanted to drop some thoughts in here.

A string based error is nice for humans to read, but doesn’t give code much of a chance of elegantly reacting. If we’re going to make a change now, I think it would be worth considering the following so that we don’t have to come back again to make further updates:

  • Error codes: an enum which describes why the error is being sent
  • Error-specific fields: based on the error code set, could be TLV or a fixed set of fields
  • Fatal bool: similar to the idea introduced here, indicating whether we need to close the channel.

I don’t want to derail conversation here , so will make my own ML post with details if this sounds reasonable.

@t-bast
Copy link
Collaborator

t-bast commented Jan 21, 2021

I think it could be reasonable for this new warning message (I'm not sure it makes sense for the error message though).

Do you have some concrete examples of what warnings could trigger an automatic action? TBH it feels like the protocols in the spec should have all cases covered to avoid needing it, but there's always a difference between theory and practice.

@rustyrussell
Copy link
Collaborator Author

Hi all, I started thinking about possible extensions to how the spec deals with errors towards the end of last year, and wanted to drop some thoughts in here.

A string based error is nice for humans to read, but doesn’t give code much of a chance of elegantly reacting. If we’re going to make a change now, I think it would be worth considering the following so that we don’t have to come back again to make further updates:

* Error codes: an enum which describes why the error is being sent

* Error-specific fields: based on the error code set, could be TLV or a fixed set of fields

* Fatal bool: similar to the idea introduced here, indicating whether we need to close the channel.

In general, these errors should never occur. It means one side has a bug. Unfortunately, that happens in practice, leading to workarounds where errors are ignored. Hence distinguishing between "Warning: you sent crap" and "Fatal: no point talking any more, I've killed the channel".

As we cannot define all possible bugs, the best you can do is log something and retry, and eventually give up (just as you would if there were no progress without any error).

There is a strong case of other kinds of errors: in the offers spec I included such things, as they can genuinely happen, and we already have onion errors. We've also talked about soft-errors in the onion context (e.g. "I don't want to worry you, but there's a delay...")

@carlaKC
Copy link
Contributor

carlaKC commented Jan 26, 2021

In general, these errors should never occur. It means one side has a bug... As we cannot define all possible bugs, the best you can do is log something and retry

Right, I think I misunderstood the original reasoning for this warning message - thanks for clarifying! Error codes make less in the infinite range of bugs that one could run into :')

Do you have some concrete examples of what warnings could trigger an automatic action?

@t-bast I'll make a ML post with my case for these error codes, rather than clogging up this issue more since they are related, but not directly relevant to a soft warning error.

rustyrussell added a commit to rustyrussell/lightning that referenced this pull request Jan 29, 2021
This takes from the draft spec at lightning/bolts#834

Note that if this draft does not get included, the peer will simply
ignore the warning message (we always close the connection afterwards
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now report the new (draft) warning message.
@rustyrussell rustyrussell changed the title BOLT 1: make errors default to "soft" errors. BOLT 1: introduce warning messages, reduce requirements to send (hard) errors, remove "all-channel" errors. Jan 29, 2021
rustyrussell added a commit to rustyrussell/lightning that referenced this pull request Feb 1, 2021
This takes from the draft spec at lightning/bolts#834

Note that if this draft does not get included, the peer will simply
ignore the warning message (we always close the connection afterwards
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now report the new (draft) warning message.
rustyrussell added a commit to rustyrussell/lightning that referenced this pull request Feb 3, 2021
This takes from the draft spec at lightning/bolts#834

Note that if this draft does not get included, the peer will simply
ignore the warning message (we always close the connection afterwards
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now report the new (draft) warning message.
rustyrussell added a commit to ElementsProject/lightning that referenced this pull request Feb 4, 2021
This takes from the draft spec at lightning/bolts#834

Note that if this draft does not get included, the peer will simply
ignore the warning message (we always close the connection afterwards
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now report the new (draft) warning message.
@niftynei
Copy link
Collaborator

niftynei commented Feb 8, 2021

As we cannot define all possible bugs, the best you can do is log something and retry, and eventually give up (just as you would if there were no progress without any error).

I'm working on the error pathways for the dual-funding RBF right now. This disambiguation between "warning" and "error" is very useful in that context. Particularly how 'warning/close connection' allows a retry from initial state conditions, in other words signals to abandon the failed RBF attempt (with an explanation as to why.)

Error codes are indeed useful for specific, narrowly defined interactions and recovery pathways, like payment attempts. In c-lightning we also encode them for our RPC errors, to help identify common cases. I can see a use case where we'd want error codes specifically for RBF failure cases, however these can be added ad-hoc for specific and well-understood cases in the future.

In the general case, returning a warning to the caller with an explanation and re-starting the attempt is a robust solution; error coded response handling logic has a tendency to degrade robustness over time ime.

vibhaa pushed a commit to spider-pcn/lightning that referenced this pull request Mar 24, 2021
This takes from the draft spec at lightning/bolts#834

Note that if this draft does not get included, the peer will simply
ignore the warning message (we always close the connection afterwards
anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now report the new (draft) warning message.
t-bast added a commit to ACINQ/eclair that referenced this pull request Jun 30, 2021
Add support for logging warning messages as introduced in
lightning/bolts#834

Support for sending warning messages instead of current errors will be
added in a later PR.
t-bast added a commit to ACINQ/eclair that referenced this pull request Jul 1, 2021
Add support for logging warning messages as introduced in
lightning/bolts#834

Support for sending warning messages instead of current errors will be
added in a later PR.
t-bast added a commit to ACINQ/eclair that referenced this pull request Jul 6, 2021
lightning/bolts#834 recommends sending
warning messages instead of connection-level errors in some cases, which
avoids unnecessary channel closure.
t-bast added a commit to ACINQ/eclair that referenced this pull request Jul 6, 2021
lightning/bolts#834 recommends sending
warning messages instead of connection-level errors in some cases, which
avoids unnecessary channel closure.
Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I finally implemented this in eclair and tested with c-lightning, and we exchange warning messages correctly.

Just needs a rebase and it's an ACK for me!

t-bast added a commit to ACINQ/eclair that referenced this pull request Jul 6, 2021
lightning/bolts#834 recommends sending
warning messages instead of connection-level errors in some cases, which
avoids unnecessary channel closure.
@t-bast
Copy link
Collaborator

t-bast commented Sep 13, 2021

Needs rebase, otherwise ACK 4f2ec06

@Roasbeef
Copy link
Collaborator

@niftynei

In the general case, returning a warning to the caller with an explanation and re-starting the attempt is a robust solution; error coded response handling logic has a tendency to degrade robustness over time ime.

If the warning message has no context at all, then how is the caller meant to parse the message to restart the attempt? Totally disagree that error code response handling logic degrades over time, when was the last time you needed to redefine how you handle the HTTP 302 error code?

@t-bast
Copy link
Collaborator

t-bast commented Sep 13, 2021

If the warning message has no context at all, then how is the caller meant to parse the message to restart the attempt? Totally disagree that error code response handling logic degrades over time, when was the last time you needed to redefine how you handle the HTTP 302 error code?

I really don't understand what you mean.
Node operators have context about what they're doing, they know when they're opening or closing channels.
So it's really helpful for them to get something when it's rejected by the other side, and a string message is actually a very flexible way of doing this.

We've already found it very useful many times in the wild compared to "the remote peer just disconnected on me, so something must be wrong with what I sent them but I have no idea what".

@TheBlueMatt
Copy link
Collaborator

Wen rebase, lets land this :)

Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 5c745fd, let's land this!

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically there, want to come to come conclusion on all-0 errors.

01-messaging.md Outdated

#### Requirements

The channel is referred to by `channel_id`, unless `channel_id` is 0 (i.e. all bytes are 0), in which case it refers to all channels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last thing we need to resolve, I'm still highly unconvinced here - yes, some of the historical places all-0 channel ids have been used was nonsense, but the idea that a peer will have a channel with me and I don't want to talk to them anymore and I'm not able to tell them that seems really wrong? Implementing all the logic to stay connected to them, parse their messages for channel IDs, and then fail those channels seems substantially complicated, and probably not relevant in cases where you just want a peer to "go away". We can change how we communicate this if backwards compatibility is the concern, but it just seems very strange to me to not be able to communicate this at all (when we currently can).

- if it hasn't sent a `funding_locked` yet:
- MAY reply to a `shutdown` message with a `shutdown`
- once there are no outstanding updates on the peer, UNLESS it has already sent a `shutdown`:
- MUST reply to a `shutdown` message with a `shutdown`
- if both nodes advertised the `option_upfront_shutdown_script` feature, and the receiving node received a non-zero-length `shutdown_scriptpubkey` in `open_channel` or `accept_channel`, and that `shutdown_scriptpubkey` is not equal to `scriptpubkey`:
- MAY send a `warning`.
- MUST fail the connection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we getting rid of the "fail the connection" terminology in this pr :)

- if the `scriptpubkey` is not in one of the above forms:
- SHOULD fail the connection.
- SHOULD send a `warning`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general (including here and the below warning conversion, but basically everywhere), we should probablyld probably have something that says "and ignore the message" or so. That would imply that the sender can re-send the shutdown with a different script (or whatever message in most contexts), which seems to be your goal here. Maybe it could be a general comment at the top of bolt 0 or 1 whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with extending the definition

@TheBlueMatt
Copy link
Collaborator

Successfully used this to debug an issue in a C-Lightning unmerged PR from LDK, so I'd call that cross-implementation-more-than-tested :)

@rustyrussell rustyrussell changed the title BOLT 1: introduce warning messages, reduce requirements to send (hard) errors, remove "all-channel" errors. BOLT 1: introduce warning messages, reduce requirements to send (hard) errors Dec 14, 2021
@rustyrussell
Copy link
Collaborator Author

Restored all-zero errors, and trivial rebase.

Copy link
Collaborator

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 570d356, thanks! Let's finally land this!

Copy link
Contributor

@lightning-developer lightning-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK.

I pointed out a few stylistic things and asked a few clarification questions.

One thing that I realized towards the end there is often the phrase:

MUST send a warning and close the connection, or send an error and fail the channel.

I believe this is defined as things a node MUST do after sending warning or error so it seems very redundant to repeat this throughout the document.

@@ -132,6 +132,20 @@ See [BOLT #11: Invoice Protocol for Lightning Payments](11-payment-encoding.md)
* _See related: [closing transaction](#closing-transaction), [funding transaction](#funding-transaction), [penalty transaction](#penalty-transaction)_
* _See types: [revoked commitment transaction](#revoked-commitment-transaction)_

* #### *Fail the channel*:
* This is a forced close of the channel. Very early on (before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a contradiction to #942 ?

01-messaging.md Outdated
The 2-byte `len` field indicates the number of bytes in the immediately following field.
1. type: 1 (`warning`)
2. data:
* [`channel_id`:`channel_id`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: This assumes / includes temporary_channel_id during funding of the channel?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

01-messaging.md Outdated
- SHOULD send `error` for protocol violations or internal errors that make channels unusable or that make further communication unusable.
- SHOULD send `error` with the unknown `channel_id` in reply to messages of type `32`-`255` related to unknown channels.
- when sending `error`:
- MUST fail the channel(s) referred to by the error message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: referred to by the error message (pending on: #941 )

the message has only one channel_id field. BOLT 1 says if the id is zero it refers to all channels between the peers. We ignore the situation where we have n channels and want to refer to m channels with 1 error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd likely require multiple warning messages, one for each channel. Though it's hard to imagine a scenario where m < n and m > 1 channels need to close due to the same issue. The all-channel error was mostly intended to "there's something wrong but I can't identify the channel it's affecting" afaik, so "there's something wrong with one of these m channels" isn't much more useful IMHO.

01-messaging.md Outdated
- when sending `error`:
- MUST fail the channel(s) referred to by the error message.
- when sending `warning`:
- MAY set `channel_id` to all zero if the warning is not related to a specific channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean the channel_id in warning and not the channel_id of the channels between the peers?

01-messaging.md Outdated
- MUST fail the channel(s) referred to by the error message.
- when sending `warning`:
- MAY set `channel_id` to all zero if the warning is not related to a specific channel.
- MAY close the connection after sending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that not hold for error? I understood error to be the case that that a force close should trigger. I see good reasons that after publishing the commitment transaction a peer might want to close the connection.

01-messaging.md Outdated
- if no existing channel is referred to by the message:
- MUST fail the channel referred to by `channel_id`, if that channel is with the sending node.
- upon receiving `warning`:
- SHOULD log the message for later diagnosis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "log" mean? Keep in memory? Write to a log file? (IIRC we don't sepc logging in the bolts)

@@ -484,7 +486,7 @@ A non-funding node (fundee):
transaction after a timeout of 2016 blocks.

From the point of waiting for `funding_locked` onward, either node MAY
fail the channel if it does not receive a required response from the
send an `error` and fail the channel if it does not receive a required response from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a contradiction here to add that it MAY fail the channel as for what a node MUST do after sending error is to fail the channel (BOLT 01 line 341)

- if the `scriptpubkey` is not in one of the above forms:
- SHOULD fail the connection.
- SHOULD send a `warning`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with extending the definition

@@ -647,7 +650,8 @@ The sending node:
The receiving node:
- if the `signature` is not valid for either variant of closing transaction
specified in [BOLT #3](03-transactions.md#closing-transaction) OR non-compliant with LOW-S-standard rule<sup>[LOWS](https://github.com/bitcoin/bitcoin/pull/6769)</sup>:
- MUST fail the connection.
- MUST send a `warning` and close the connection, or send an
`error` and fail the channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant "fail the channel"

@@ -673,7 +677,8 @@ The receiving node:
- MUST propose a `fee_satoshis` in the overlap between received and (about-to-be) sent `fee_range`.
- otherwise, if `fee_satoshis` is not strictly between its last-sent `fee_satoshis`
and its previously-received `fee_satoshis`, UNLESS it has since reconnected:
- SHOULD fail the connection.
- SHOULD send a `warning` and close the connection, or send an
`error` and fail the channel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant "fail the channel"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that wording comes several times I stop pointing this out

01-messaging.md Outdated
- SHOULD send `error` for protocol violations or internal errors that make channels unusable or that make further communication unusable.
- SHOULD send `error` with the unknown `channel_id` in reply to messages of type `32`-`255` related to unknown channels.
- when sending `error`:
- MUST fail the channel(s) referred to by the error message.
- when sending `warning`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were restoring all-0s? Looks like its still removed here, and further down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants