bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

facundomedica · 2022-04-22T03:14:01Z

Summary of Bug

TL;DR: capabilitytypes.FwdCapabilityKey produces different results on darwin-arm64

A binary compiled for the Apple M1 chip (darwin-arm64) is not compatible with other architectures, causing consensus failure. It happens after a new IBC channel is created (I'm not sure if it's ibc.core.client.v1.MsgUpdateClient ibc.core.channel.v1.MsgChannelOpenTry causing it).

Failed to process committed block (1005:4FB45CC400ADE7B6F640EBE90C51C5EB58C0161D65C9F0B56771914C2F2E6A7D): wrong Block.Header.LastResultsHash.  Expected 520376AE1EC01E88011892D398C88ADA48EB941F6DC975DCA45125C4398A8D44, got E30709AEEC492763438DA9DB0F64E9C69EFDAF237AE10778FFA4C03009347904

This issue seems to affect only darwin-arm64 and not linux-arm64.

Version

v0.45.2

Steps to Reproduce

Start chain on a Linux machine (arm64 or amd64, doesn't matter)
Create an IBC channel
Join with a Mac, it will halt on the block in which the IBC channel was created

Debugging

I tracked down this issue to a mismatch in gas used caused by a "difference in encoding".

Check below the operations and the data they are working with.

My macbook: gas_used:177437
Server: gas_used:177371
Diff: 66

Mac
gas consumed, operation, key/value
1000 ReadFlat ibc/fwd/0x14000ffda48
63 ReadPerByte ibc/fwd/0x14000ffda48 <- +3
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x140084adec8
630 WritePerByte ibc/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x140084adec8
78 ReadPerByte transfer/fwd/0x140084adec8 <- +3
0 ReadPerByte 
...
2000 WriteFlat transfer/fwd/0x140084adec8
780 WritePerByte transfer/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0

Linux server
1000 ReadFlat ibc/fwd/0x40010081a8
60 ReadPerByte ibc/fwd/0x40010081a8
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x4005b534a8
600 WritePerByte ibc/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x4005b534a8
75 ReadPerByte transfer/fwd/0x4005b534a8
0 ReadPerByte 
...
2000 WriteFlat transfer/fwd/0x4005b534a8
750 WritePerByte transfer/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0

It's clear that the key that has an extra 1 at the beginning is causing the issue.

To quickly replicate

package main

import (
	"fmt"
	capabilitytypes "github.com/cosmos/cosmos-sdk/x/capability/types"
)

func main() {
	cap := &capabilitytypes.Capability{}
	fmt.Println(string(capabilitytypes.FwdCapabilityKey("ibc", cap)))
}

Run the code above in a linux-arm64 (or any other) and then in a darwin-arm64.

ibc/fwd/0x4000c68290 <- linux-arm64
ibc/fwd/0x140004a7f00 <- darwin-arm64

So I don't know enough about these things to make a fix suggestion (unless removing any leading chars that exceed the length is an acceptable suggestion lol). We should find the Go docs that explain this behavior and act accordingly, but I couldn't find those docs.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

alexanderbez · 2022-04-22T13:00:30Z

@facundomedica your debugging skills are absolutely unrivaled. Thanks for opening up this issue. Unfortunately, i think there are some cases where we need to use pointers, especially for IBC caps. However, the plus side is that pretty much all operators run in the cloud or even bare metal with standard amd64 architecture. Which is why we haven't seen this in mainet most likely.

I'm not really sure what to do. In the meantime, I'll do some digging in the golang repo to see if there are any similar issues

Tagging folks for visibility.

@faddat @marbar3778 @jackzampolin @ValarDragon

odeke-em · 2022-04-22T15:26:13Z

Kindly cc-ing an ARM expert @elias-orijtech

ValarDragon · 2022-04-22T18:59:21Z

It looks to me that the code is just straight-up unsafe.

https://github.com/cosmos/cosmos-sdk/blob/main/x/capability/types/keys.go#L41-L43

We just can't do %p in any state machine logic. This should not be expected to be consistent across machine architectures. (To be honest, I'm surprised its the same run after run on a single machine)

I don't understand why %p is used at all here to be honest. We should likely add a linter to prevent %p from entering any state machine logic.

ValarDragon · 2022-04-22T19:01:12Z

I feel like everything thats trying to be achieved with pointers, could be achieved with a global counter in the context, and putting counter values in the CapabilityTypes struct.

facundomedica · 2022-04-22T20:07:56Z

Another option would be to make it a fixed-length string with padding. So we make sure that never ever this happens again, at least in 64bit archs. It would increase the gas usage slightly, but it will give us peace of mind.
For this we wouldn't need a migration right? Given that is using the in-memory kv.

ValarDragon · 2022-04-22T20:28:35Z

Thats true, that would short-term work, and can be added pretty quickly. Is doing that something we could get in v0.46.0 ? (cc @marbar3778 @AmauryM , though IDK whose the right person to tag for that question)

I do think the pointer formatting should be aimed to be removed though (perhaps in a separate issue), as imo its code smell tha may cause confusing problems in the future with debugging across any two distinct systems. (I imagine it alters by golang, and OS version as well)

tac0turtle · 2022-04-23T09:27:13Z

Yes lets get this in!! @facundomedica would you like to open a pr?

facundomedica · 2022-04-23T12:08:47Z

BTW, couldn't we just use the capability index instead of its pointer? According to the docs: The index provided to a Capability must be globally unique. (Idk 100% how this code works, just checking)

ValarDragon · 2022-04-23T20:03:07Z

oh, if that already exists that seems far superior to use

facundomedica · 2022-04-24T00:53:13Z

In that case, it's a trivial fix. But I'm afraid it was done like that on purpose, at least that's what I get from reading https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-003-dynamic-capability-store.md . Although I don't fully comprehend these docs 😅

ValarDragon · 2022-04-24T01:26:06Z

Yeah, reading through that it feels like they were aware of the problem and added an index, but then also didn't consistently use the index?

Definitely seems to me like an index should work, the whole point of the capmap is theres a bijection between the pointer and the index. This definitely feels like a bug to me.

The only reason I could see for it being claimed to not be a bug would be that "the randomness of a pointer is a feature", which seems just wrong. If thats wanted, just get a random 64 bit padded number as the index. I suspect this is not the case, and its just an accident that slipped through review.

cc @AdityaSripal @colin-axner

AdityaSripal · 2022-04-25T09:36:27Z

Hmm I don't understand how the forward mapping itself could cause an apphash mismatch.

The forward key is stored in the memstore which is not committed in the final apphash. https://github.com/cosmos/cosmos-sdk/blob/main/store/mem/store.go#L56

The fact that the two machines have different pointers is expected behavior. It would be true even if they were 2 Linux machines.

the randomness of a pointer is a feature

This is the intention. See from the ADR

Object-capability keys are currently pointers (memory addresses) of StoreKey structs created at application initialisation in app.go (example) and passed to Keepers as fixed arguments (example). Keepers cannot create or store capability keys during transaction execution — although they could call NewKVStoreKey and take the memory address of the returned struct, storing this in the Merklised store would result in a consensus fault, since the memory address will be different on each machine (this is intentional — were this not the case, the keys would be predictable and couldn't serve as object capabilities).

@ValarDragon:

We just can't do %p in any state machine logic. This should not be expected to be consistent across machine architectures. (To be honest, I'm surprised its the same run after run on a single machine)

The pointers are only being stored in the local node's state. They are not expected to be the same across nodes. Nor is it expected to be the same between different runs (ie. if the node stops and restarts). In fact, it is intentionally part of the design that different nodes would have different pointers to the capability.

What is expected is that while the node is running, the memory location of the capability does not change from the time it was last initialized by the state machine. This possibly has changed with M1???

The reason it was designed this way, was explicitly to keep the capability outside of the state machine.

The reasoning was if the capability was an object in state, then a malicious module or user could simply copy that capability and pass it in.

Instead, the capability key was the in-memory pointer itself. Each node would generate a capability and store it in memory.

In order to authenticate an action that requires a given capability, a caller must pass in the capability (the exact same pointer), rather than just a copy of the capability. Each node will then check this pointer against their stored pointer value.

The caller can get the exact capability on the given node by requesting it from the capability keeper who will authenticate that the caller is authorized to get the original capability pointer.

Since every node is retrieving in-memory capabilities from its own local store, and verifying these capabilities against its own local store. The authentication is not happening directly in the state machine. Each node is independently doing its own in-memory capability check and then returning to the state machine whether the authentication passed.

Arguably this is overengineering given the current SDK security model between modules, but it was a requirement put into the ICS specification.

I'd be interested in seeing the exact tx response that occurs during the channel handshake on the Mac. It would be the first handshake msg to be executed on the chain (Either INIT or TRY). You should be able to retrieve it @facundomedica by querying the txhash that got committed in the channel handshake.

My guess is that the Linux machine is authenticating the capability locally correctly. But the Mac is not authenticating the capability correctly.

This would imply to me that the capability got stored in some memory location during initialization and then the Mac changed the memory location while the node was still running. So that when it was later retrieved the new pointer did not match the stored pointer. I'm not sure if that's possible but it would certainly break capabilities.

Thanks @facundomedica for an excellent breakdown and investigation!

facundomedica · 2022-04-25T11:07:51Z

@AdityaSripal thanks for the detailed explanation! 🙏
About why the hash is different: It's indirectly caused by the FwdKey, because on a Mac it will be higher, when you convert it to hex you will get an extra byte stored in memory KV. That extra byte counts towards gas_used, which is used to calculate the hash.
We have this issue on a live testnet: https://explorer.umeemania-1.network.umee.cc/umee/blocks/1004 query there anything you need and lmk if you would like to put some log statements and I can do that locally :)

alexanderbez · 2022-04-25T13:32:02Z

What @AdityaSripal explained is correct -- we used %p for a reason.

AdityaSripal · 2022-04-25T14:42:14Z

Ahh interesting, so the issue is that the the memory store is metering gas.

Yea. we could standardize pointer length as a fix (pad or truncate to a standard length).

It shouldn't even be breaking, because as mentioned above, it is not committed to state

alexanderbez · 2022-04-25T14:43:23Z

Yeah, but the in-mem store is gas metered and that affects gas_used which does affect the merkle-ized state :-/

So it's an indirect state change

AdityaSripal · 2022-04-25T14:46:24Z

Ahh ok, yea I just saw the discussion on the PR. Yes that's annoying.

At least if you standardize to the same length that linux machines are currently using, It will be technically breaking, but effectively nodes could upgrade independently.

Agreed that we should consider if we should meter gas on memstores, but that's a larger discussion that probably doesn't need to block a fix here

ValarDragon · 2022-04-26T02:52:22Z

I don't think the pointer decision should be deemed ok. To achieve the targetted goal (which I think is in a very weird model as you noted), instead of a global counter, it should use a randomly selected nonce and use that. Basically randomly sample a (say) 64 bit number, from a system seeded rng (random module works). Can easily resample on collisions, so no soundness error applies. (Though again, this threat model is weird, since it assumes an attacker has no memory access to this map... which is still at a fixed memory address)

The literal pointer value provides very little security guarantees, and has very few guarantees from golang itself. Golang doesn't do ASLR https://rain-1.github.io/golang-aslr.html, so it should be trivial to get the underlying pointer value from any occasion where you can do code insertion.

I've viewed the ocap model we use as providing protection, assuming "nicely" written modules, and protecting against accidental bad API usage. If thats still the general view we have of it, then both the counter or random sampling suffice, dependent on the desired level of protection

alexanderbez · 2022-04-26T13:56:17Z

@ValarDragon that's a good point. A PRNG could equally work as well...I think. @AdityaSripal do you have any thoughts on this?

facundomedica · 2022-04-27T15:39:04Z

PRNG could be the answer to all of this, we could even just truncate the number to match the current key length, that way we won't have the gas usage issue (the current fix changes that and increases it)

alexanderbez · 2022-04-27T16:04:07Z

Let's do it. @facundomedica something you wanna tackle? No need for a migration either.

facundomedica · 2022-04-27T16:11:51Z

Sure, should I revert the previous fix or just go at it in a new PR?

ValarDragon · 2022-04-27T16:20:53Z

I'd recommend going at it in a new PR. If something weird happens and new PR is blocked for w/e reason, at least we already have something that can go in v0.46 to solve the consensus divergence

Making it match current key length is a great idea :)

facundomedica · 2022-05-02T14:14:41Z

Couldn't dedicate much time to this, but these are my findings so far about using a PRNG:

We would need to store a link between the capability index and the random number (causing extra reads and writes, causing the gas usage to be higher)
Collisions will cause a consensus failure, given that we would need to try again to find another random number that hasn't been used (again increasing gas usage)

Some other options to consider:

Use base64 to encode the pointer instead of hex. <- This sounds quite solid IMO, so we just do the same thing we are doing now but the base64 allows us to encode bigger numbers with less chars (with just 12 bytes we could encode MAX UINT64).
hashing the pointer + the capability index? Using SHA256 and then using a fix number of chars of the hex representation?

alexanderbez · 2022-05-02T14:37:04Z

Why do you think a PRNG would not suffice? I think a bit extra gas costs should be fine IMO. Also, where would there be anything related to consensus failure on collisions as all of the cap index stuff is local ephemeral in-mem anyway. When you ask x/cap for a index and there is a collision, just try again?

I like your idea of augmenting the current pointer approach, but I think @ValarDragon pointed out limitations about using pointers. Maybe that's moot though.

facundomedica · 2022-05-02T15:20:05Z

I think that collisions would cause the same issue that's described here, in which for some nodes they'll get a random number that's not being used on the first time, and some others will have to retry. And retrying will get them to spend extra gas, which will result in a different gas usage than other nodes (because we are counting even the "exist" queries in the mem store).

alexanderbez · 2022-05-02T15:26:10Z

And retrying will get them to spend extra gas

Why does regenerating another number cost more gas? AFAIK, only the CRUD operations (get, set, etc..) cost gas. So picking a PRN, should be zero cost.

facundomedica · 2022-05-02T17:32:45Z

(@alexanderbez I was talking about using the memstore KV, which counts gas)

After a while of looking at the code, I think I'm not sure how to implement this fix tbh. I have some questions and would love some extra direction here, the solutions I'm thinking of are not addressing the issue.

FwdCapabilityKey is getting the pointer from the capability, if we want to use a random number, should we add another field to the capability? But wouldn't that defeat the entire purpose of having a random number if you could just pass any number you want?

@ValarDragon said the following, which I understand is referring to changing the Index for a random number. But this index global counter is being stored in the persistent KV as far as I can see.

To achieve the targetted goal (which I think is in a very weird model as you noted), instead of a global counter, it should use a randomly selected nonce and use that.

I was pretty confident that this was a straightforward fix (and maybe it is) but I got tangled up in the problem 😅

alexanderbez · 2022-05-02T20:30:38Z

Ahhh, wait you're totally right. In order to determine if there is a collision, you have to perform a read lol.

@ValarDragon what exactly is your suggestion?

I mean technically the solution as it exists today works, it's just not safe cross-platform, so if we could devise the key to not use %p but perhaps some fixed-length encoding of it, then that would suffice.

ValarDragon · 2022-05-03T03:10:11Z

This concrete problem could be solved by an in memory map, thats outside of whats metered by stores. (A map usedNonces map[int64]struct{})

I just think its bad form for us to rely on pointers as a source of RNG. This becomes the sort of thing that has edge cases / odditiess that haunt us far into the future.

I also question the entire threat model here though. I'd be in favor of just removing this entire pointer logic in the IBC spec. (As I noted in my post -- pointers in golang do almost nothing for security over a counter here)

Anyone have context on who was originally in favor of it? Want to make sure we get their perspective / can talk about what they perceive as the benefit.

alexanderbez · 2022-05-03T13:52:03Z

I don't want to yak-shed @ValarDragon. If you think the current threat model is moot and not needed, let's open a separate issue/discussion for that 🙌

@facundomedica, I like @ValarDragon's suggestion. We can use a go-native map outside the gas metered memstore to hold the PRNs and thus avoid the gas costs. Can you go with that?

facundomedica · 2022-05-03T20:57:17Z

I think I'm missing something (or everything lol) here, the answer is not obvious to me at all 😅
At the moment we use the capability's pointer to authenticate it. We would need to add a new field to it adding the PRN. So would adding an exported field (easily modifiable) helps us in any way?

alexanderbez · 2022-05-03T21:37:37Z

The PRN would replace the pointer. So the cap store knows the mapping from capability to PRN.

facundomedica · 2022-05-04T14:05:08Z

After some internal discussion with @alexanderbez we've come to the conclusion that this is not solvable with PRNs given that we would need to attach it somehow to the Capability object making it useless as a security feature. So for now #11737 is enough (fixed-length pointer encoding).

facundomedica changed the title ~~bug: Can't sync darwin-arm64 after creating IBC channel (different gas used)~~ bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) Apr 22, 2022

facundomedica mentioned this issue Apr 23, 2022

fix: Use fixed length hex for pointer at FwdCapabilityKey #11737

Merged

19 tasks

julienrbrt closed this as completed in #11737 Apr 25, 2022

julienrbrt reopened this Apr 25, 2022

facundomedica mentioned this issue May 4, 2022

chore: add missing changelog entry for #11737 #11870

Merged

19 tasks

facundomedica closed this as completed in #11870 May 4, 2022

RafilxTenfen mentioned this issue Jun 21, 2022

Mac M1 binary won't sync (darwin-arm64) umee-network/umee#826

Open

4 tasks

facundomedica mentioned this issue Jul 18, 2022

Cosmos-SDK wrong Block.Header.LastResultsHash #12583

Closed

4 tasks

colin-axner mentioned this issue Aug 8, 2022

fix: Use fixed length hex for pointer at FwdCapabilityKey (backport #11737) #12818

Merged

19 tasks

tac0turtle mentioned this issue Feb 1, 2023

Unit tests fail on Mac M1 using v0.45.12 #14868

Closed

kjessec mentioned this issue Aug 30, 2023

Backport FwdCapabilityKey Patch osmosis-labs/cosmos-sdk#484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

facundomedica commented Apr 22, 2022

alexanderbez commented Apr 22, 2022 •

edited

Loading

odeke-em commented Apr 22, 2022

ValarDragon commented Apr 22, 2022

ValarDragon commented Apr 22, 2022 •

edited

Loading

facundomedica commented Apr 22, 2022

ValarDragon commented Apr 22, 2022

tac0turtle commented Apr 23, 2022

facundomedica commented Apr 23, 2022

ValarDragon commented Apr 23, 2022

facundomedica commented Apr 24, 2022

ValarDragon commented Apr 24, 2022

AdityaSripal commented Apr 25, 2022 •

edited

Loading

facundomedica commented Apr 25, 2022

alexanderbez commented Apr 25, 2022

AdityaSripal commented Apr 25, 2022

alexanderbez commented Apr 25, 2022

AdityaSripal commented Apr 25, 2022

ValarDragon commented Apr 26, 2022 •

edited

Loading

alexanderbez commented Apr 26, 2022

facundomedica commented Apr 27, 2022

alexanderbez commented Apr 27, 2022

facundomedica commented Apr 27, 2022

ValarDragon commented Apr 27, 2022 •

edited

Loading

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022 •

edited

Loading

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022

ValarDragon commented May 3, 2022

alexanderbez commented May 3, 2022

facundomedica commented May 3, 2022

alexanderbez commented May 3, 2022

facundomedica commented May 4, 2022

bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

Comments

facundomedica commented Apr 22, 2022

Summary of Bug

Version

Steps to Reproduce

Debugging

To quickly replicate

For Admin Use

alexanderbez commented Apr 22, 2022 • edited Loading

odeke-em commented Apr 22, 2022

ValarDragon commented Apr 22, 2022

ValarDragon commented Apr 22, 2022 • edited Loading

facundomedica commented Apr 22, 2022

ValarDragon commented Apr 22, 2022

tac0turtle commented Apr 23, 2022

facundomedica commented Apr 23, 2022

ValarDragon commented Apr 23, 2022

facundomedica commented Apr 24, 2022

ValarDragon commented Apr 24, 2022

AdityaSripal commented Apr 25, 2022 • edited Loading

facundomedica commented Apr 25, 2022

alexanderbez commented Apr 25, 2022

AdityaSripal commented Apr 25, 2022

alexanderbez commented Apr 25, 2022

AdityaSripal commented Apr 25, 2022

ValarDragon commented Apr 26, 2022 • edited Loading

alexanderbez commented Apr 26, 2022

facundomedica commented Apr 27, 2022

alexanderbez commented Apr 27, 2022

facundomedica commented Apr 27, 2022

ValarDragon commented Apr 27, 2022 • edited Loading

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022 • edited Loading

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022

facundomedica commented May 2, 2022

alexanderbez commented May 2, 2022

ValarDragon commented May 3, 2022

alexanderbez commented May 3, 2022

facundomedica commented May 3, 2022

alexanderbez commented May 3, 2022

facundomedica commented May 4, 2022

alexanderbez commented Apr 22, 2022 •

edited

Loading

ValarDragon commented Apr 22, 2022 •

edited

Loading

AdityaSripal commented Apr 25, 2022 •

edited

Loading

ValarDragon commented Apr 26, 2022 •

edited

Loading

ValarDragon commented Apr 27, 2022 •

edited

Loading

alexanderbez commented May 2, 2022 •

edited

Loading