-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: commitment transaction dips peer below chan reserve: internal error causing eclair to FC #7657
Comments
Here is the log from my side:
I just see that my node received an "internal error" from my peer and then force closed. @Crypt-iQ Looks like another case of a link sending out "internal error" causing a force close |
@DerEwige could you check in your logs which was the latest According to the spec your SENDING node should:
The RECEIVING node followed the spec in a way:
Definitely worth checking why eclair added the htlc although it would force the peer to go below reserve. Whether lnd sends a warning and closes the connection or sends an internal error should also be reevaluated. |
Unfortunately, I have most of my eclair running on log level “WARN”. This is due to the excessive amount of logs my rebalancing produces. So I don’t have any additional logs from my side. I can only see from the logs of my plugin that I tried to send out payment through this channel:
This case applies if the sender falls below the reserve.
This one applies. My node should not have sent the HTLC. Somehow the specs does not say anything about how the receiver should act if it would dip him below the reserve? Did I miss that part? |
From my perspective this is the same, when you add an HTLC to the remote commit (sending an Update_ADD_HTLC), you as the sender need to make sure that you evaluate the transaction limits beforehand meaning that adding this new output in the current fee-environment should not let your peer fall below reserve. Your peer will just receive the amt and try to add it to the transaction at current feelimits, if that forces the peer to fall below its reserve it should fail this attempt (this is also what happened)
Every channel has a synchronized feerate in sat/kweight which only the initiator can change via the update_fee msg. So the feerate was definitely the same, but why eclair and lnd diverged in the final calculation of the transaction is strange, maybe some rounding issue. Thats why I asked you supply the amount so that one could backtrace whether the boundaries were breached or not, but without logs very hard to do so.
|
From the spec PoV, the channel basically can't continue at this point. Similarly, if someone else us an invalid pre-image, that's also an invalid action and the channels can't continue. This might ultimately be a interoperability issue: does the HTLC eclair send actually dip below the reserve? If so, can we repro that exact scenario to dig into the accounting between the implementations? Is it an mSAT rounding thing? |
I had a second force close with the same issue: peer logs:
my logs:
I will also open a issue with eclair and link the two issues |
This is the important question. Eclair does check reserve requirements (for sender and receiver) before sending an HTLC: https://github.com/ACINQ/eclair/blob/77b333731f618395f33d69d1fe0aba2c86cdba58/eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala#L440 @DerEwige your node has all the information we need to debug this. Can you:
We should be able to create a unit test based on that data on both eclair and lnd to see why reserve calculation doesn't match. |
channel_fc1.log
Unfortunately I run this logback config:
So the HTLC is not logged. |
Thanks, that should be enough to replay the latest state and figure out why there is a mismatch between eclair and lnd. I'll investigate that state and share the contents of a unit test to try in both eclair and lnd to figure out where our balance calculation diverges. |
I have been able to reproduce the flow of messages that leads to the two channel states shared by @DerEwige. First of all, an HTLC Eclair LND
| |
| update_add_htlc | htlc_id = 1_477_859
|------------------------->|
| commit_sig |
|------------------------->|
| revoke_and_ack |
|<-------------------------|
| commit_sig |
|<-------------------------|
| revoke_and_ack |
|------------------------->| At that point, both commitments are synchronized and contain only that HTLC. Then Eclair LND
| |
| update_add_htlc | htlc_id = 1_477_861
|------------------------->|
| update_add_htlc | htlc_id = 120_896
|<-------------------------|
| commit_sig commit_sig |
|-----------+ +-----------|
| \/ |
| /\ |
|<----------+ +---------->|
| revoke_and_ack |
|<-------------------------|
| revoke_and_ack |
|------------------------->|
| commit_sig |
|------------------------->|
| error |
|<-------------------------| Here are the details of the commitments when
You can check that in the files shared by @DerEwige. The important thing to note is that there is nothing that makes either peer dip into their reserve. |
It seems like lnd sends this error whenever something wrong happens on their side, regardless of whether the channel actually needs to be closed. We ignore it to avoid paying the cost of a channel force-close, it's up to them to broadcast their commitment if they wish. See lightningnetwork/lnd#7657 for example.
It seems like lnd sends this error whenever something wrong happens on their side, regardless of whether the channel actually needs to be closed. We ignore it to avoid paying the cost of a channel force-close, it's up to them to broadcast their commitment if they wish. See lightningnetwork/lnd#7657 for example.
@TrezorHannes Could you provide more logs so we could understand what had happened during those 90s?
|
@yyforyongyu did you see the attached log in the first post? I'm afraid I don't have more, my debug level is info. |
@TrezorHannes yep that's where I got those two lines. Need more info to debug the issue. |
@yyforyongyu I'm afraid I shared all I had. But perhaps @t-bast can help, who was able to reproduce the issue in a test-environment. |
I've already shared everything in my previous comment. This confirmed that this issue seems to be on the lnd side, I don't have much else I can add... |
@t-bast thanks for that throughly analysis! We'll def prio this and look into what's going on. |
I also had 2 more force closes due to this problem. Here is the log of one of the force closes |
@t-bast I was verifying your numbers and figured out that you kinda neglected the commitment fees, the lnd node has to pay:
cap is 3 mio, 34_367_703 + 2_756_274_986 + 117_941_049 + 80_040_806 + 11_375_456 = 3_000_000_000 msats additional commitment fees: (3*173 + 1124) * 2500 sats/kweight = 4107,5 sats (in fees) + 660 sats (2 anchor values) = 4767,5 sats Taking these costs into account the local balance falls below the reserve to_local: 34_367_703 msat - 4767500 = 29600203 (which is below the reserve) |
Actually I think one cannot guarantee these situations will not happen. The problem is that the Lnd node does yet not know before adding its own HTLC that the remote is adding one (additional 173 weight * 2500 sats/kweight = 432500 msat) which is exactly the difference which dips us below the reserve, same happens for eclair when adding their output, they don't know yet that we are adding an incoming HTLC to them? |
As a potential fix I would add a safety net to the peer who opened the channel, lets say we always keep a buffer of 10*IncomingHTLC_Weight (173) * CFee (2500 sat/kweight) = 4325 sats (in this example with a CFee of 2500) additional buffer so that we would not fall below reserve if the peer added 10 incoming HTLCs in one revocation window.
|
Nice, very good catch @ziggie1984 that explains it and the numbers match.
Right, in that case we cannot completely prevent this with the current commitment update protocol, since eclair and lnd concurrently add an HTLC, and each have to take into account the remote peer's HTLC (there's no way to "unadd") which forces dipping into the reserve. If we actually allowed dipping into the reserve here, that could be abused: instead of 1 concurrent HTLC in each direction, one of the peers could send many of them to dip arbitrarily low into the reserve, which creates bad incentives.
I'm surprised you don't already do that as part of lightning/bolts#919? I thought every implementation had put this mitigation in place 2 years ago (otherwise lnd is potentially at risk of the kind of dust attack described in that PR). Eclair has that kind of mitigation, which should have avoided that situation if eclair had been the channel funder. I guess that issue is another good argument for lightning/bolts#867, which will remove all of those edge cases 😅 |
So we have a potential safety net here but as you mentioned above for concurrent updates, this is not good enough so I propose the fix above and always count to certain number of incoming HTLCs as a buffer. https://github.com/lightningnetwork/lnd/blob/master/lnwallet/channel.go#L5378-L5391
|
what I am wondering tho, why is eclair signing this breach of the contract, also for eclair it must be known that the remote_commitment dips below reserve, do you actually log this even as a warning ?
Hmm we have a dust exposure check in place but this does only count for dust htlcs in the code base.
Thanks for the hint, need to look at this proposal. Sounds promising hopefully not decreasing the speed to much :) |
Eclair would indeed have complained at a later step and sent an
Sorry that was the wrong "additional reserve", I meant this one: lightning/bolts#740 |
could you maybe point me to the code part where eclair defines the additional buffer, just to get a feeling about the size of it :) |
Sure, you can find that code here: https://github.com/ACINQ/eclair/blob/3547f87f664c5e956a6d13af530a3d1cb6fc1052/eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala#L435 This function is called when trying to send an outgoing HTLC: if that HTLC violates some limit (such as this additional buffer), we simply don't send it (and try another channel with the same peer, or if none work, propagate the failure to the corresponding upstream channel). |
Basically Eclair (I think also CLN) has a fee buffer of 2x the current commitment fee (the additional HTLC consideration will be consumed by the htlc which will be send out by the current payment request so I do not count it here (we already do this). Every new Incoming HTLC adds approx. 20% fees, so the buffer of 200% (2x current commitment fee) buys us 10 htlc slots (for concurrent htlcs).This increases when htlcs are on the channel (commitment tx bigger) which acts as a nice HTLC limiter when our balance is very low). I like this approach as well and would implement the additional feebuffer, let me know what you guys think 👻 This would basically implement this: lightning/bolts#740
|
Indeed LND is missing the fee spike buffer (see #7721). We should implement that. The fee spike buffer helps, but IIUC it doesn't completely prevent concurrent HTLC problems. But even if an HTLC race occurs, why do we need to error and force close? If the next commitment would violate the channel reserve, can't we just disconnect to drop the proposed updates? Then on reconnect we can do exponential backoff to prevent the race from reoccurring indefinitely. |
@morehouse hehe I came here to link that very issue. AFAICT, it's an edge case where we send an HTLC (as the initiator), can pay for that HTLC, but then the remote party sends another HTLC concurrently (they haven't seen our yet), once that HTLC is ACK'd and a sig ready to go out, we fail as adding that would dip us below the reserve. |
Right. So we could just disconnect to get out of that bad state (should also send a warning). No need to send an error and cause a force close. Edit: Never mind, I think I see the problem now. The 2 commitments contain different HTLCs that are OK on their own, but combined they cause the channel reserve to be violated. And we can't go back to the previous state because it's already been revoked. |
I think this race could still be detected and avoided without Rusty's simplified commitment update. When we receive a |
I think the issue is we don't know what to include in |
The case we've covered in
The case we should cover in
We should add a method The case cannot be fixed atm,
There are a few mitigations for this case,
|
Hmm I am not sure how this is possible. When we send out HTLC to the peer which they haven't acked, does not mean we can forget about it (signature is out and we need a revocation for this), they will eventually revoke_ack their state including the HTLCs we offered them, forcing us to include them as well in our Commitment when verifying their signature (because they anticipate that we have it on our local commitment as well when they acked our offered HTLCs?)
what do you mean by this, we stop using the channel ? |
With the current protocol where each node immediately sends out a CommitSig (50ms) after sending out the UpdateAddHTLC we would also make the channel unusable because we need a revocation from the peer. So I think the feebuffer is our best shot for this problem? We just keep slots for the peer to add some htlcs, in case it happens concurrently? |
Yes, but then this would be the third case. |
Just to add a little bit more color:
There are probably a bunch of things we could do here -- we could be more strict in |
Background
For unknown reason (possible racecondition) my LND instance identified that a downstream HTLC commitment would have caused a channel peer balance going negative, and below our chan reserve for that channel.
That caused my LND to send my peer an
internal error
. @DerEwige running Eclair, which immediately going on-chain when receiving aninternal error
.Two things to validate:
internal error
and not just a failing HTLC?Your environment
lnd
:lnd version 0.16.2-beta commit=v0.16.2-beta
uname -a
on *Nix):Linux debian-nuc 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
btcd
,bitcoind
, or other backend:Bitcoin Core version v24.0.1
Steps to reproduce
Log excerpt (full log grep of channel-id and pubkey attached)
Expected behaviour
Proposal above, details better evaluated by folks who're more acustomed to cross-implementation error relays.
internal error
Actual behaviour
Sending an
internal-error
causing eclair to force-closespeedupln-fc.log
The text was updated successfully, but these errors were encountered: