Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update client may cause "new header has a time from the future" chain error #1445

Closed
5 tasks
ancazamfir opened this issue Oct 11, 2021 · 16 comments · Fixed by #1477
Closed
5 tasks

Update client may cause "new header has a time from the future" chain error #1445

ancazamfir opened this issue Oct 11, 2021 · 16 comments · Fixed by #1477
Labels
I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC I: spec Internal: related to IBC specifications
Milestone

Comments

@ancazamfir
Copy link
Collaborator

ancazamfir commented Oct 11, 2021

Crate

relayer

Summary of Bug

@mircea-c has reported that the following errors are seen once in a while when updating the sifchain client (07-tendermint-395) on cosmoshub:

Oct 11 09:13:03.057 DEBUG ThreadId(42) [cosmoshub-4] simulated gas: Err(
    0: #033[91mGRPC call return error status status: InvalidArgument, message: "failed to execute message; message index: 0: cannot update client with ID 07-tendermint-395: failed to verify header: inval
id header: new header has a time from the future 2021-10-11 13:12:56.673817244 +0000 UTC (now: 2021-10-11 13:12:51.047099028 +0000 UTC; max clock drift: 5s): invalid request", details: [], metadata: Meta
dataMap { headers: {"content-type": "application/grpc"} }#033[0m

After looking at the sif chain configuration (client created with clock_drift = 5sec) and comparing block times (5-6 sec on sif, 7-13 sec on hub), the following seems to be the issue.
Here is an example (- is 1 sec, b are blocks on sif, B are blocks on the hub).

	   1:ev 2:clUpdate
           |    |  |
           |    | 3: fetch header (h)
           v    v  |
b----b-----b----b-----b----b      (sifchain 5/6 sec blocktimes)
                |  | 4: simulate/ send Tx
                |  v
B-------B------------B------B     (cosmoshub 7-13 sec)
        ^       |
        | 5: create context with latest header (H)
        |       |
        v       v
        <-------> 6: check that h.timestamp - drift < H.timestamp (aka now)

The steps are:
1 - an IBC event that requires a clientUpdate
2 - hermes starts to build the client update message and
3 - fetches the header h from sifchain (this happens for the header at event height + 1)
4 - sends a simulate request to the hub (can happen on send_tx as well)
5 - the chain receives the request and creates a context with the latest header H on the hub
6 - the client handler is called with that context and the message. Among other things it checks that the h.timestamp is not "in the future", i.e. H.timestamp + client_state.clock_drift > h.timestamp

(one 13 sec example on the hub:
h:7969646 - ts:16:31:24
h:7969647 - ts:16:31:37
)

One problem is that the clock_drift of 5 sec is not enough for clients of sifchain on the hub. But it might be appropriate for clients of sifchain on other chains. This in itself is a hint on the difficulty of sizing this properly. Also changing the configuration alone is not enough to fix the issue, it would require new clients, connections and channels to be created, and applications to switch to these.

The other solution is to make hermes' update client code more robust. This issue is to investigate and if possible fix this on hermes side.
cc @adizere @romac
A number of ideas were mentioned, hermes could maybe perform the H.timestamp + client_state.clock_drift > h.timestamp check when updating tendermint clients, or drop the tx on the specific tx simulate (or tx_send()) error ( (to be picked by the clear packet mechanism or retries for channel/ connection handshake).

Version

all

Steps to Reproduce

We should be able to reproduce this with local setup, create two chains with different block times, create client with small clock_drift values.

Acceptance Criteria

Ideally this error should never be seen with correct headers. Impact on hermes operation should be minimal with "bad" headers, e.g. one that is in the long future.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ValarDragon
Copy link

ValarDragon commented Oct 12, 2021

Jacob was reporting this same error (osmosis-labs/osmosis#520 (comment)) potentially also on a sifchain channel, not really sure.

Two questions:

  • Is there any safety issue with increasing max_clock_drift to say 15 seconds?
  • Is it possible to make the error print the counterparty chains block height, and chain-id, to ease debugging? (e.g. we can also go talk to that block proposer to fix their local time)

@ancazamfir
Copy link
Collaborator Author

ancazamfir commented Oct 13, 2021

Jacob was reporting this same error (osmosis-labs/osmosis#520 (comment)) potentially also on a sifchain channel, not really sure.

Two questions:

  • Is there any safety issue with increasing max_clock_drift to say 15 seconds?

I think it's fine and advisable for new clients. Existing clients would need to be upgraded.

  • Is it possible to make the error print the counterparty chains block height, and chain-id, to ease debugging? (e.g. we can also go talk to that block proposer to fix their local time)

It's not straightforward at exactly the place the error is issued but we will find a way to print more info.
The problem though might not be an unsynchronized clock in this case.
I noticed that once in a while we have the same proposer for two consecutive blocks and when this happens the block time doubles. I gave one example in the description, here is another one on the hub:
7988809 - 08:37:19
7988810 - 08:37:31
In the few cases I checked the second block is an empty block.

In the example in the description the third block 7988810 has client updates tx-es.

When suggesting setting 5s as default we only accounted for the actual clock drift and not the block time. Not sure if this is something the on-chain client handler should/ could consider taking into account when checking for the drift cc @colin-axner.

From hermes side we should consider the following:

  • change the default
  • document that clock_drift should include a rough estimate of the block time (maybe x 2 to account for the longer block times we see with emtpy blocks)
  • on client update verify that the latest header timestamp on destination chain is recent enough such that the update header will not be considered in the future, print a warning log and wait for next block on destination.
  • use also the destination chain's clock_drift when creating the client (i think)

@ancazamfir
Copy link
Collaborator Author

Maybe the total clock_drift value when creating a client needs to consider the worst case, i.e. it should be:
src_chain.clock_drift + dst_chain.clock_drift + dst_chain.block_time
meaning we need to have a separate block_time configured per chain, unless we can estimate it via some query(?).
cc @milosevic for help here.

@ancazamfir
Copy link
Collaborator Author

doh, just saw that empty block time is configurable via create_empty_blocks_interval. Does this mean that the max block time depends on the max value of create_empty_blocks_interval across validators? And is there any guarantee that create_empty_blocks = true everywhere?

@colin-axner
Copy link
Contributor

colin-axner commented Oct 13, 2021

Not sure if this is something the on-chain client handler should/ could consider taking into account when checking for the drift

I'm not sure if it is necessary to do on chain. Taking into account the block time drift would require knowing the expected block time? We recently added an on chain parameter for this (so relayers could use that value). But to take this into account on chain for the light client, we would need to add a new parameter to the ClientState? That upgrade path currently is very difficult (it is IBC protocol breaking). In the short term, it is better to account for this in MaxClockDrift (although now this field is misleading) cc @AdityaSripal


Re upgrading existing clients:

Currently I'd say this is infeasible. The only two options are:

  • chain upgrade + custom migration code
  • if AllowUpdateAfterExpiry = true, expire client - submit gov proposal - pass gov proposal

Regular client upgrades cannot be used since this is a custom field. I've been planning to open a issue/discussion around ignoring the Allow fields and allowing governance to override any client and any of its fields. Practically this seems useful

@AdityaSripal
Copy link

MaxClockDrift is fine to use here to mean: src_chain.clock_drift + dst_chain.clock_drift + dst_chain.block_time

We can just make it incumbent on the relayer who creates the client to set a MaxClockDrift that takes into account the total clock drift possible between the two chains.

Re: Upgrading existing clients

Yes this is currently not possible. We explicitly prevent custom fields to change after the fact, to prevent the security parameters of a client changing after it has been created.

  1. We can update the default ClockDrift to be larger
  2. We can maybe introduce a governance proposal that allows the custom fields of a client to be changed

Regarding what Hermes can do immediately:

You can either perform the check client side and only submit the header once it is no longer in the future
You can add a static wait time before submitting the header to the counterparty. This may still occassionally run into problems and its a bit brittle, but it is trivial to do and will work 99% of the time.

@ancazamfir
Copy link
Collaborator Author

We recently added an on chain parameter for this (so relayers could use that value).

Could you please provide more details on this?

But to take this into account on chain for the light client, we would need to add a new parameter to the ClientState?

This would be the local chain block time to be added to the client state drift....
if !untrustedHeader.Time.Before(now.Add(maxClockDrift).Add(selfBlockTime)) {
assuming i read the code correctly and now is the latest block time on this chain.

You can add a static wait time before submitting the header to the counterparty. This may still occassionally run into problems and its a bit brittle, but it is trivial to do and will work 99% of the time.

I think we may as well wait for the next block since nothing can proceed if client update fails and it is as trivial as waiting for a fixed duration.

@ancazamfir
Copy link
Collaborator Author

You can either perform the check client side and only submit the header once it is no longer in the future

yup, that's the plan

@colin-axner
Copy link
Contributor

Could you please provide more details on this?

03-connection now has an on-chain parameter MaxExpectedTimePerBlock. This is used to enforce block delays for packet processing. It is supposed to represent the maximum expected time it takes to produce a block (any duration longer than that specified should be an anomaly). It doesn't have a strong definition though, it is fairly arbitrary to pick the value for. The default is 30 seconds, which I expect most chains will likely use

@faddat
Copy link
Contributor

faddat commented Oct 13, 2021

@ValarDragon this issue is basically all over.

I first saw it over a month ago with osmo-regen, but have since seen it nearly everywhere -- and regardless of the relayer software being used.

@faddat
Copy link
Contributor

faddat commented Oct 13, 2021

@ancazamfir

The best place to see create_empty_blocks_interval in action is the e-money chain, which has variable block times. It is configured for a one minute heartbeat, and will otherwise make blocks on demand. Notional validates e-money from a Raspberry Pi.... because we can :P

@adizere adizere added I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC I: spec Internal: related to IBC specifications labels Oct 14, 2021
@adizere adizere added this to the 11.2021 milestone Oct 14, 2021
@ancazamfir
Copy link
Collaborator Author

@ancazamfir

The best place to see create_empty_blocks_interval in action is the e-money chain, which has variable block times. It is configured for a one minute heartbeat, and will otherwise make blocks on demand. Notional validates e-money from a Raspberry Pi.... because we can :P

Thanks! we will use a similar local testnet for testing the fix.

@milosevic
Copy link
Contributor

Sorry for being late here. In the light client spec, clockDrift is a parameter dealing with only approximately synchronized clocks. It is used only to constrain what time from the future is allowed, i.e., an acceptable time is smaller than now + clockDrift. In IBC case, we don't have time that is necessarily synced with some source (for example NTP time) as we used bft time, so we could in theory have arbitrary clock drifts between chains.

If I am not wrong @ancazamfir was considering the following scenario: we consider a top header h at chain A, and a top header H at chain B, and we try to understand what is the condition under which header h will be considered acceptable from time perspective on the chain B. I don't see other way of making this work in any case unless relayer is checking top header time on chain B and making sure that the condition is satisfied. This is at least from theoretical perspective only sound in my view, as we could have period with no blocks on the chain B.

In case you try to optimise for the normal case where blocks are continuously being produced on the both chains (normal case), we need to take into account that from the perspective of time of the latest block on the chain B time of the header on chain A is not too much in the future. For example, in the worst case if a header h is committed roughly at time t - chainB.average_block_time, where t is time at which header h is committed (from bird eye perspective), i.e., the latest block on chain B is lagging behind chain A, we need to have the following satisfied so h is accepted:

h.time + clock_drift_A < h.time - clock_drift_B - chainB.average_block_time + clock_drift_B, i.e.,
clock_drift_A + chainB.average_block_time < 0, which is impossible as both are positive values.

My point here is that even if we assume that blocks on both chains are produced continuously so we know what is the average block time, and bft times on both chains are in sync, we can still be in this scenario where destination chain still hasn't produced the latest block which time is close to now, while such block is produced on the source chain. In this case, no matter how we choose parameters, the client update might fail. Does this make sense or I am misunderstanding something?

@ancazamfir
Copy link
Collaborator Author

Thanks Zarko!
It makes sense although I have some doubts on the math, isn't it?:

h.time + clock_drift_A < H.time - clock_drift_B - chainB.average_block_time

which means that herems could check before the update:

h.time + clock_drift_A + clock_drift_B + chainB.average_block_time < H.time

For this purpose I was proposing creating the client with a "clock_drift" of
clock_drift_A + clock_drift_B + chainB.average_block_time

  • note that clock_drift becomes misleading now but we are stuck with this per .proto file

So the check would be: h.time + client_state.clock_drift < H.time, and exactly what the on-chain handler does.

Now the question is what do we do in hermes if this check doesn't hold. The cause could be:

  • the drifts and block times were not estimated/ configured properly
  • the header is in fact a bad header which is in the future

We could drop the header and return error. In most cases the effect will be that the event that triggered the update will be re-processed and will result in an update with a header at a different height.

Another alternative is to create instead the client with drift clock_drift_A + clock_drift_B and if the check fails:

  • wait for next block H' on B (*)
  • check again h.time + client_state.clock_drift < H'.time and drop the header if it fails

This one is tempting at first look as it avoids adding yet another configuration parameter and the pain of coming up with a good value. But...
(*) the thread/ worker may block for a long time if chains don't have a heartbeat which is probably fine from a functional pov. We could also limit the time to some big value. But in this case we would not be able to catch some cases of in-the-future header.

In either implementation we would alert the user that an in-the-future header was seen so the proposer's clock can be checked.

@ancazamfir
Copy link
Collaborator Author

ancazamfir commented Oct 18, 2021

@ancazamfir

The best place to see create_empty_blocks_interval in action is the e-money chain, which has variable block times. It is configured for a one minute heartbeat, and will otherwise make blocks on demand. Notional validates e-money from a Raspberry Pi.... because we can :P

@faddat Could you please paste a sample [consensus] section of your config.toml?

@jackzampolin
Copy link

We are also seeing this issue with the golang relayer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: dependencies Internal: related to dependencies I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC I: spec Internal: related to IBC specifications
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants