Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributable errors #60

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

joostjager
Copy link
Contributor

@joostjager joostjager commented Nov 10, 2022

Error attribution is important to properly penalize nodes after a payment failure occurs. The goal of the penalty is to give the next attempt a better chance at succeeding. In the happy failure flow, the sender is able to determine the origin of the failure and penalizes a single node or pair of nodes.

Unfortunately it is possible for nodes on the route to hide themselves. If they return random data as the failure message, the sender won't know where the failure happened.

This PR updates the failure encryption mechanism so that the failure source can always be determined.

References:

crypto.go Outdated Show resolved Hide resolved
fat_error_decrypt.go Outdated Show resolved Hide resolved
fat_error_decrypt.go Outdated Show resolved Hide resolved
@joostjager joostjager force-pushed the fat-errors branch 3 times, most recently from a8fce5c to 444400b Compare February 8, 2023 12:18
@joostjager joostjager changed the title fat errors Attributable errors Feb 8, 2023
)

type AttributableErrorStructure struct {
HopCount int
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are u16,u8 etc in the BOLT, maybe use the same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ints here to prevent casting to a wider data type in the various calculations that use these values. I think you could argue that the byte that is specified in the bolt is a limitation on the wire level and that the code in this repo doesn't need to follow that limitation necessarily?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have the types 1:1. How bad would the casting be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the casting overhead would be negligible in terms of affected code sites. Inheriting the proper int types here aides in readability IMO.

Also in most cases, you don't really need to cast for comparisons, as you can make generic funcs that use constraints. Integer or w/e.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also missing godoc comments here.

Is this an error struct (actually error type), or something else entirely? In either case, adding "Structure" to the end seems superflous.

Copy link
Contributor Author

@joostjager joostjager May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an error struct, but a description of the actual structure of the onion failure. I think the addition Structure is appropriate in this case. Added godoc to explain.

Will leave this thread open and await movement on the spec side with regards to the data types. In the spec PR I chose byte for now, but perhaps we might go for something even more compact with just a few options for the structure (enum-like).

attributable_error_crypto.go Outdated Show resolved Hide resolved
attributable_error_crypto.go Outdated Show resolved Hide resolved
attributable_error_crypto.go Outdated Show resolved Hide resolved
fat_error_decrypt.go Outdated Show resolved Hide resolved
@Roasbeef
Copy link
Member

cc @bitromortac

@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from 52a8a08 to fccd040 Compare April 24, 2023 18:30
@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from 9afb8a0 to c0682ea Compare April 24, 2023 18:40
@lightninglabs-deploy
Copy link

@joostjager, remember to re-request review from reviewers when ready

@joostjager
Copy link
Contributor Author

!lightninglabs-deploy mute

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial pass to build up my mental model. Will sit with the spec for a bit before doing another pass here.

crypto.go Outdated Show resolved Hide resolved
}

// generateSharedSecret generates the shared secret by given ephemeral key.
func (r *Router) generateSharedSecret(dhKey *btcec.PublicKey) (Hash256, error) {
func (r *Router) GenerateSharedSecret(dhKey *btcec.PublicKey) (Hash256, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc above not updated.

Is this just so the implemented interface only has public methods, or that this method is to be used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc updated.

The method is used in #7139. Access is needed to the shared secret so that the type of onion encryptor can be 'upgraded' after reading the error structure parameters from the onion tlv.

)

type AttributableErrorStructure struct {
HopCount int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the casting overhead would be negligible in terms of affected code sites. Inheriting the proper int types here aides in readability IMO.

Also in most cases, you don't really need to cast for comparisons, as you can make generic funcs that use constraints. Integer or w/e.

)

type AttributableErrorStructure struct {
HopCount int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also missing godoc comments here.

Is this an error struct (actually error type), or something else entirely? In either case, adding "Structure" to the end seems superflous.

}

func newAttributableErrorBase(
structure *AttributableErrorStructure) attributableErrorBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onionError instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the legacy error an onion-routed error too?

attributable_error_crypto.go Outdated Show resolved Hide resolved
attributable_error_crypto.go Outdated Show resolved Hide resolved
hash := hmac.New(sha256.New, umKey[:])

// Include message.
_, _ = hash.Write(message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an encrypted message right? So the ciphertext and not the plaintext.

If not, then we need to revise the proposal, as encrypt-then-mac is a must as we want to ensure integrity of the ciphertext. Otherwise the MAC may also give some information about the plaintext itself.

Might just have an out of date mental model tho....decided to take a look at the code before the spec to reaload some context.

Copy link
Contributor Author

@joostjager joostjager May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is hmac-then-encrypt. Legacy failures are also hmac-then-encrypt - at least for the error source because the other nodes only encrypt.

If the complete message including hmacs is encrypted as the final step before passing it back to the upstream node, how can information about the plaintext leak?

var hmacIdx = maxHops + position

// Iterate over all downstream nodes.
for j := 0; j < maxHops-position-1; j++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the inclusion of positional/index information here confusing: how will a forwarding node actually derive its position? If this is given to it by the sender, ofc that breaks w/e privacy we may have w/ the current scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do not need to derive their position. And this information also isn't given to them by the sender, because that would indeed break privacy.

The core idea of this proposal is to iterate over every possible position that the node may have in the path, and add a corresponding hmac for each of those positions.

// Extract the payload and exit with a nil message if it is invalid.
source, payload, err := o.extractPayload(payloads)
if sender == 0 {
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't handle the error at all if the sender is known?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling consists of returning the current position to the caller.

Some invalid data was encountered because we couldn't extract the payload. This means that there is no actual message to return. With the position information, the caller can apply a selective penalty.

Allow for more flexible usage of the error encrypter. This is useful when upgrading an
existing legacy error encrypter to fat errors in lnd.
@joostjager joostjager force-pushed the fat-errors branch 3 times, most recently from e244f62 to 6649a30 Compare May 23, 2023 13:39
@joostjager joostjager requested a review from Roasbeef May 23, 2023 13:43
Copy link

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, quite a smart way to approach the problem!

Having HopCount is probably not ideal because it could be used for fingerprinting, but this ensures to have enough space for future extensions, right? This brings me to the question whether it's worthwhile to stay below the TCP MTU limit for latency reasons (similarly as for the onion size in update_add_htlc)?

If we have b bits (say 8000) available for the HMACs and security of s bits in the HMACs (HMAC chopped off after s bits, as you suggested out of band), we would get

s = b / (HopCount * (HopCount + 1) / 2)

HopCount |   s
--------------
27       |  21
20       |  38
18       |  47
16       |  59
14       |  76
12       | 102

For comparison, the Bitcoin network has a hash rate that manages s~80 in 10 min. In practice anything higher than that seems to be very high. The downside of having a fixed total message size below b is that the number of total hops is limited.


// AttributableErrorStructure contains the parameters that define the structure
// of the error message that is passed back.
type AttributableErrorStructure struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would manage to reduce the size of the total error, would it be desirable to get rid of the HopCount as it may add a fingerprint vector? Together with FixedPayloadLen, this is mostly to allow flexibility concerning future usage like nodes communicating back different metadata?

Copy link
Contributor Author

@joostjager joostjager Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, FixedPayloadLen is to allow future flexibility without increasing the failure message until it is needed.

HopCount is slightly different. It gives senders some control over the failure message size, at the cost of limiting the maximum route length. If the size of the total error can be reduced so that HopCount can be hard-coded to the max route length of 27, that would be great. On the other hand, perhaps long routes are so uncommon that a lower constant could be acceptable too.

attributable_error_crypto.go Outdated Show resolved Hide resolved
_, _ = hash.Write(message)

// Include payloads including our own.
_, _ = hash.Write(payloads[:(o.maxHops()-position)*o.payloadLen()])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the position parameter? Currently the comment states that this is the distance to the error source. If position=0, i.e, we are the erring node, we would only need to include our own payload, is it? If so, I would expect payloads[:(position+1)*o.payloadLen()].

Copy link
Contributor Author

@joostjager joostjager Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this parameter is indeed inverted. Updated the code to match the comment. In the spec pr, I also define position relative to the error source.


// Ensure the error message length is enough to contain the payloads and
// hmacs blocks. Otherwise blame the first hop.
if len(encryptedData) < minPaddedOnionErrorLength+o.hmacsAndPayloadsLen() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would an intermediate node react to another node forwarding a short message (probably out of scope for this PR)?

Copy link
Contributor Author

@joostjager joostjager Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good question.

In master what happens is that a random failure message is made up if the length is below the (legacy) minimum:
https://github.com/lightningnetwork/lnd/blob/e549096b8837a95e0515c8c63fa01c634714d913/htlcswitch/link.go#LL1889C2-L1889C2

For attributable errors though, it may pass this check. Then in the intermediate EncryptError, an error is returned.

I think currently in lightningnetwork/lnd#7139, the intermediate node ignores the error and passes back the failure message as is. This allows an attacker to get away with it.

What should happen instead is that the intermediate node creates a new, properly formatted failure to attribute blame to the connection between itself and its downstream peer. Added comment https://github.com/lightningnetwork/lnd/pull/7139/files#r1228037240

attributable_error_decrypt.go Outdated Show resolved Hide resolved
Comment on lines 166 to 176
// Clear first hmac slot. This slot is for the position farthest away
// from the error source. Because we are shifting, this cannot be
// relevant.
copy(hmacs[destIdx*sha256.Size:], zeroHMAC[:])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't convince myself whether clearing is needed.

Copy link
Contributor Author

@joostjager joostjager Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be reading this field when decrypting the error. I left this in as a 'sanity clear'...

@@ -0,0 +1,25 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add a comment to explain the test, for example which payloads were used or what parameters were used? Payloads could also be added to the test vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is taken straight from the spec PR where the parameters are explained. I will do more polishing after we've decided that attr errors is going to happen and what variation of it exactly.

@joostjager
Copy link
Contributor Author

joostjager commented Jun 13, 2023

If we have b bits (say 8000) available for the HMACs and security of s bits in the HMACs (HMAC chopped off after s bits, as you suggested out of band)

Yes, I think this would be a useful optimization. You could argue that attributing failure isn't the most critical thing in lightning. Even with a relatively high chance of an attacker forging an hmac that is considered valid (let's say 21 bits, 1 in 2,000,000), game theory is still very much against them. If they guess wrong, they'll be penalized and will have to wait for their reputation to be restored before they'll receive new traffic from that sender.

@joostjager
Copy link
Contributor Author

Feedback from lnd dev call:

  • Hard-code max_hops to 27 to avoid fingerprinting. Perhaps it could be one or two hops less also because the max hops goes down when we signal attributable errors.
  • Reduce hmac bits to keep failure message reasonably small (bits tbd, maybe divisible by 8, maybe 32 bits?)

For the payload that each node adds, we could for now just go for the minimum size. No tlv also. Especially because we don't even have concrete ideas what to add besides the timestamp. If this changes in the future, we can invent another signal in the forward onion.

This would eliminate all parameters from the forward onion and leave just a zero-length tlv record to signal sender support for attributable errors.

@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from 9b2478b to 74fcef9 Compare June 16, 2023 10:53
@joostjager
Copy link
Contributor Author

Added a parameter in the attributable error structure struct for the size of the hmac.

The other changes mentioned above can happen in the lnd pr.

Copy link

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉! I compared the left/right-shift/HMAC operations to my notes and they are identical. Great to have the attributable errors completely parametrized such that we can define them for the protocol.


return &DecryptedAttrError{
DecryptedError: DecryptedError{
SenderIdx: 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to distinguish a malicious sender from the true sender of the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we do that? If our peer returns a message that is too short, then they didn't do what do should have done. They should always return a message of the correct length that includes at least their valid hmacs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about distinguishing the happy path decoding (we can determine the true sender) from the unhappy one (isCorruptedSender bool), but I'm not sure how that would be used for penalization yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unhappy path is currently signaled through a nil DecryptedError.Message. But perhaps it should be more explicit? This is where I miss rust's expressiveness in data types.

attributable_error_crypto_test.go Show resolved Hide resolved
@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from fbef088 to 9530e97 Compare June 19, 2023 18:05
)
copy(dummySecret[:], bytes.Repeat([]byte{1}, 32))

// We'll iterate a constant amount of hops to ensure that we don't give
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the intention to guard against timing attacks by constant-hopping is admirable, it may be a case of over-engineering. Timing attacks can originate from numerous other sources, including differences in data processing times, network latency, and even hardware behavior. To truly protect against timing attacks, a comprehensive approach is needed that covers all potential sources, making constant-hopping only a part of the solution.

Moreover, adding complex measures such as this may introduce new, unforeseen security vulnerabilities or performance issues. It could also divert resources from tackling more likely and more dangerous threats. Therefore, a balanced, risk-based approach to security might be more appropriate than focusing too heavily on one specific, and potentially less probable, attack vector.

[So far for the AI]

In short, shall we just leave this out for the sake of simplicity?

@joostjager
Copy link
Contributor Author

Test vector updated to 20/4/4 structure in line with lightning/bolts#1044 (comment)

@Roasbeef Roasbeef removed their request for review November 11, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants