Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htlcswitch: send an error before marking a channel w/ local data loss #7312

Closed
wants to merge 1 commit into from

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Jan 12, 2023

Change Description

In this commit, we move to also send an error before we "fail" our channel when we realize that either we've lost data, or are recovering from some partial state.

This is the first half of the set of changes needed to fix: #6017. To start with, we'll ensure that the new lnd nodes always send the error once the receive a chan reest and determine that they're behind. In the second half of the fix, we'll start to hold off on force closing until the other party sends an error. We opt to do this in two parts as: there're many lnd nodes out there that still rely on the current behavior. Once we get another update cycle under our belt, we can start to catch up our behavior fully.

Steps to Test

Interop test with Eclair+CLN+LDK to ensure that the SCB flow still works.

The SCB flow in this PR should also still work with the existing lnd node versions as well.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

Fixes #7301

In this commit, we move to also send an error before we "fail" our
channel when we realize that either we've lost data, or are recovering
from some partial state.
@Roasbeef Roasbeef added safety General label for issues/PRs related to the safety of using the software spec interop interop with other implementations SCB Related to static channel backup labels Jan 12, 2023
@yyforyongyu yyforyongyu self-requested a review January 12, 2023 11:44
// continue.
//
// TODO(roasbeef): use reliable sender here?
l.sendFatalError()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be before the call to MarkDataLoss? I don't believe MarkBorked is called in peer/brontide.go for this since PermanentFailure will be false. It also seems like an Error is already sent since ErrRecoveryError is used in the call to l.fail(...) below. Is the issue here that the p2p connection may go down after MarkDataLoss, but before we eventually call ShouldSendToPeer in handleLinkFailure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually had this initially, but the rationale was that by the time we send the Error they might have up already, which would cause some of the logic below to short circuit.

Is the issue here that the p2p connection may go down after MarkDataLoss, but before we eventually call ShouldSendToPeer in handleLinkFailure?

My understanding going into this change was that other implementations started to mostly ignore what we send for chan reest, even if it's all very invalid. So we now need to send the Error as that's the only thing they'll force close off of.

IIUC, on restart, we'll also repeat this loop all over again, w/ the MarkDataLoss then being a noop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing the pipeline a bit more, you're right that we'll eventually call handleLinkFailure in, which'll then fail since it's sending out ErrRecoveryError.

@Roasbeef
Copy link
Member Author

Ok, based on the above comment, this PR is unnecessary: we always send the error when we detect that we're behind and go into SCB or recovery mode. So then an open question is: what behavioral change did the other implementations make that caused the SCB flow to no longer work for some users.

@Roasbeef Roasbeef closed this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop interop with other implementations safety General label for issues/PRs related to the safety of using the software SCB Related to static channel backup spec
Projects
None yet
2 participants