Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: Remove panics on failure to send IBC packets #876

Merged
merged 13 commits into from
May 22, 2023
12 changes: 8 additions & 4 deletions x/ccv/consumer/keeper/relay.go
Original file line number Diff line number Diff line change
Expand Up @@ -196,13 +196,17 @@ func (k Keeper) SendPackets(ctx sdk.Context) {
)
if err != nil {
if clienttypes.ErrClientNotActive.Is(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for if clienttypes.ErrClientNotActive.Is(err) is no longer needed since behavior is the same whether that statement is true or not. The type of error will be printed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is a bit different. For expired clients we log an info message. For other errors, we log an error as something is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But "error" is in the name, ErrClientNotActive. I see the logging behavior is slightly different for that error, but imo a semantic error should be logged as an error. Which also makes the code more simple/readable

k.Logger(ctx).Debug("IBC client is inactive, packet remains in queue", "type", p.Type.String())
// IBC client is expired!
// leave the packet data stored to be sent once the client is upgraded
k.Logger(ctx).Info("IBC client is expired, cannot send IBC packet; leaving packet data stored:", "type", p.Type.String())
return
}
// something went wrong when sending the packet
// TODO do not panic if the send fails
panic(fmt.Errorf("packet could not be sent over IBC: %w", err))
// Not able to send packet over IBC!
// Leave the packet data stored for the sent to be retried in the next block.
// Note that if VSCMaturedPackets are not sent for long enough, the provider
MSalopek marked this conversation as resolved.
Show resolved Hide resolved
// will remove the consumer anyway.
k.Logger(ctx).Error("cannot send IBC packet; leaving packet data stored:", "type", p.Type.String(), "err", err.Error())
return
}
}

Expand Down
13 changes: 9 additions & 4 deletions x/ccv/provider/keeper/relay.go
Original file line number Diff line number Diff line change
Expand Up @@ -194,12 +194,17 @@ func (k Keeper) SendVSCPacketsToChain(ctx sdk.Context, chainID, channelID string
// IBC client is expired!
// leave the packet data stored to be sent once the client is upgraded
// the client cannot expire during iteration (in the middle of a block)
k.Logger(ctx).Debug("IBC client is expired, cannot send VSC, leaving packet data stored:", "chainID", chainID, "vscid", data.ValsetUpdateId)
k.Logger(ctx).Info("IBC client is expired, cannot send VSC, leaving packet data stored:", "chainID", chainID, "vscid", data.ValsetUpdateId)
return
}
// TODO do not panic if the send fails
// https://github.com/cosmos/interchain-security/issues/649
panic(fmt.Errorf("packet could not be sent over IBC: %w", err))
// Not able to send packet over IBC!
k.Logger(ctx).Error("cannot send VSC, removing consumer:", "chainID", chainID, "vscid", data.ValsetUpdateId, "err", err.Error())
// If this happens, most likely the consumer is malicious; remove it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpoke it's not intuitive to me why a failed IBC send would always indicate a malicious consumer. The error being returned could be caused by corrupted IBC state on the provider, where stoping a consumer could make such a scenario worse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my analysis on the reasons for ccv.SendIBCPacket to return error on the provider side:

  • Channel not found – not possible as SendIBCPacket is called only for consumer with an established CCV channel and channels are not removed from the IBC module
  • Module does not own channel capability – not possible as the capability was already claimed in OnChanOpenTry, before the CCV channel was marked as established
  • The next sequence send is not found – if that happens, something is wrong with the IBC module store
  • The packet sent failed basic validation – if that happens, something is wrong with the code on the provider chain as the packet is constructed in SendIBCPacket
  • The channel’s state is CLOSED – an opened channel can be closed only by one of the following functions
    • channelKeeper.ChanCloseInit, which is called on the provider only when the consumer chain is removed (note that OnChanCloseInit returns an error, thus user-initiated channel closing are not allowed)
    • channelKeeper.ChanCloseConfirm, which entails that the consumer endpoint of the channel is CLOSED; this can happen only if the consumer receives a negative ack for a packet sent to the provider
      • Attack: A malicious consumer could send an invalid SlashPacket that fails ValidateSlashPacket, this would lead to a nack sent to the consumer, which results in channelKeeper.ChanCloseInit being called on the consumer. Finally, this would enable a relayer to send a MsgChannelCloseConfirm to the provider, which would move the channel’s state to CLOSED without updating the ChainToChannel state.
    • channelKeeper.TimeoutExecuted, which means that OnTimeoutPacket is first executed on the provider, which results in the consumer chain being removed (i.e., the ChainToChannel state is being updated).
  • The packet’s destination channelD and portID do not match the ones of the counterparty endpoint – not possible due to construction of the packet
  • There is no underlying connection – not possible as it was checked during the handshake
  • There is no underlying client state – not possible as it was checked during the handshake
  • The packet is already timed out on the consumer – not possible as the packet timeout is set in SendIBCPacket to the current block timestamp plus the timeout period and the latest known timestamp on the consumer cannot be larger than this packet timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more conservative, we could just log an error and leave the packets in the queue. Operators need to notice that VSCPackets are no longer being relayed, otherwise the unbondings in those packets will never complete. Note that the VSC timeout is set when the packet is sent, and not when it's queued.

So we run the risk of nobody noticing for a while, which will affect users. Once we notice, we need to either do an emergency upgrade where we migrate the state (difficult) or we submit a ConsumerRemoval proposal, which would take 2 extra weeks before delegators that unbonded can get their funds back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tough question. According to the untrusted consumer doctrine, we should not let consumers disrupt the provider. So that would lean towards just removing the consumer.

However, the unbonding pausing is somewhat of an exception that we allow, since it is limited by the VSC timeout. Since we already allow the consumer to disrupt the provider up to the VSC timeout just by not sending VSC acks, it would make sense to allow disruption up to the VSC timeout here as well.

However, if we don't stop the consumer, is it possible to recover from this state without a provider migration?

Finally, this would enable a relayer to send a MsgChannelCloseConfirm to the provider, which would move the channel’s state to CLOSED without updating the ChainToChannel state.

Is there any way to change this, so that a MsgChannelCloseConfirm would update the ChainToChannel state?

So in conclusion:

  • I support stopping the consumer here, but I want to make sure that this can only happen due to very incorrect code on the consumer, and is very unlikely to happen "by accident".
  • Why do we send a nack in response to an invalid slash packet? It seems like it would be better to just ignore those packets to avoid having the consumer stop in this code here.

err := k.StopConsumerChain(ctx, chainID, true)
if err != nil {
panic(fmt.Errorf("consumer chain failed to stop: %w", err))
}
return
}
// set the VSC send timestamp for this packet;
// note that the VSC send timestamp are set when the packets
Expand Down