Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panics on failure to send IBC packets #731

Closed
2 tasks
andrey-kuprianov opened this issue Feb 10, 2023 · 0 comments · Fixed by #876
Closed
2 tasks

Panics on failure to send IBC packets #731

andrey-kuprianov opened this issue Feb 10, 2023 · 0 comments · Fixed by #876
Assignees
Labels
source: audit To indicate an issue found during an audit. type: bug Issues that need priority attention -- something isn't working

Comments

@andrey-kuprianov
Copy link

Surfaced from @informalsystems audit on Interchain Security, at commit 463ec20

Problem

Currently, both provider and consumer chains would panic and halt if sending of IBC packets fails for the reason different than ErrClientNotActive. This problem is known to the team (see #649), and no correcting actions are needed before the ICS 1.0 release, but the problem needs to be alleviated before ICS 1.1 release (before the first consumer chain goes live).

Closing criteria

Do not panic on failure to send IBC packets.

Problem details

In the implementation under audit, namely in the functions SendVSCPacketsToChain() on provider, and SendPackets() on consumer, the code panics if sending of IBC packets fails. As these functions are called from EndBlockers, the respective chain will halt on error. This has indeed happened in a testnet, when sending of an IBC packet failed because of an expired LightClient (see issue ICS#435). A partial fix has been introduced, in which the packets are first queued, and only then sent, with errors about light client expiration are logged, and the unsent packets remain queued. The resulting code looks similar on both provider and consumer; here is the relevant part of the provider code:

		// send packet over IBC
		err := utils.SendIBCPacket(...)

		if err != nil {
			if clienttypes.ErrClientNotActive.Is(err) {
				// IBC client is expired!
				// leave the packet data stored to be sent once the client is upgraded
				// the client cannot expire during iteration (in the middle of a block)
				k.Logger(ctx).Debug("IBC client is expired, cannot send VSC, leaving packet data stored:", "chainID", chainID, "vscid", data.ValsetUpdateId)
				return
			}
			// TODO do not panic if the send fails
			// https://github.com/cosmos/interchain-security/issues/649
			panic(fmt.Errorf("packet could not be sent over IBC: %w", err))
		}

As can be seen, upon receiving an error from sending an IBC packet, if the error is of one specific type (ErrClientNotActive), the function stops processing; on any other returned error the function will panic, and thus halt the provider. The problem with the presented approach is that it introduces an implicit assumption on the behavior of code that sends IBC packets, namely that the only error that it may return under normal circumstances is ErrClientNotActive. But even if this is true currently (though normal circumstances is hard to define), it can be seen that the function ibc-go/modules/core/04-channel/keeper/packet.go:SendPacket(): a) contains at least 10 different errors it may return, and b) is updated relatively frequently, so even if the above condition holds now, it may seize to hold upon the next IBC update.

Problem Scenarios

If, under some combination of parameters, or upon an IBC update, the IBC packet send returns an unexpected error, the provider or consumer chain will halt. The crux of the problem is in the immediate and harmful reaction to a problem that may have transient character.

Recommendation

In general, we recommend following the modular approach outlined in the issue ICS#627. Considering this specific case, the situation is much simplified by the fact that all infrastructure for delayed actions is already in place: IBC packets are queued both on the provider and on the consumer, and even if the packet can't be sent now, it can be resent later. Thus, we recommend the following:

  • Remove the aforementioned panics on failure to send IBC packets
  • Remove the special handling of the error ErrClientNotActive
  • On any error, instead of panicking, issue an event, as well as a log message, describing the problem, and return from the function.
  • Implement the monitoring solution that would watch for such events or log entries, and notify the human operators (e.g. validators) of the problem that ICS is unable to solve.

In that way, no immediate harmful actions will be taken wrt. either provider or consumer, and the problem will be delegated to humans for further actions; the ICS packets will then simply wait in the respective queues until the correcting actions are taken.

TODOs

  • Labels have been added for issue
  • Issue has been added to the ICS project
@mpoke mpoke added the source: audit To indicate an issue found during an audit. label Feb 22, 2023
@mpoke mpoke added this to the ICS v1.1.0 milestone Feb 23, 2023
@mpoke mpoke added this to Cosmos Hub Feb 23, 2023
@github-project-automation github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Feb 23, 2023
@mpoke mpoke moved this from 🩹 Triage to 📥 Todo in Cosmos Hub Feb 23, 2023
@mpoke mpoke modified the milestones: ICS v1.1.0, ICS next release Mar 5, 2023
@mpoke mpoke modified the milestones: ICS next release, Gaia v10.0.0 Mar 13, 2023
@mpoke mpoke self-assigned this Apr 22, 2023
@mpoke mpoke moved this from 📥 Todo to 🏗 In progress in Cosmos Hub Apr 22, 2023
@mpoke mpoke added the type: bug Issues that need priority attention -- something isn't working label Apr 24, 2023
@mpoke mpoke assigned shaspitz and mpoke and unassigned mpoke and shaspitz Apr 26, 2023
@mpoke mpoke assigned shaspitz and unassigned mpoke May 16, 2023
@mpoke mpoke moved this from 🏗 In progress to 👀 In review in Cosmos Hub May 22, 2023
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cosmos Hub May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: audit To indicate an issue found during an audit. type: bug Issues that need priority attention -- something isn't working
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

3 participants