Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726

Closed
4 tasks
facundomedica opened this issue Apr 22, 2022 · 34 comments · Fixed by #11737 or #11870

Comments

@facundomedica
Copy link
Member

Summary of Bug

TL;DR: capabilitytypes.FwdCapabilityKey produces different results on darwin-arm64

A binary compiled for the Apple M1 chip (darwin-arm64) is not compatible with other architectures, causing consensus failure. It happens after a new IBC channel is created (I'm not sure if it's ibc.core.client.v1.MsgUpdateClient ibc.core.channel.v1.MsgChannelOpenTry causing it).

Failed to process committed block (1005:4FB45CC400ADE7B6F640EBE90C51C5EB58C0161D65C9F0B56771914C2F2E6A7D): wrong Block.Header.LastResultsHash.  Expected 520376AE1EC01E88011892D398C88ADA48EB941F6DC975DCA45125C4398A8D44, got E30709AEEC492763438DA9DB0F64E9C69EFDAF237AE10778FFA4C03009347904

This issue seems to affect only darwin-arm64 and not linux-arm64.

Version

v0.45.2

Steps to Reproduce

  1. Start chain on a Linux machine (arm64 or amd64, doesn't matter)
  2. Create an IBC channel
  3. Join with a Mac, it will halt on the block in which the IBC channel was created

Debugging

I tracked down this issue to a mismatch in gas used caused by a "difference in encoding".

Check below the operations and the data they are working with.

My macbook: gas_used:177437
Server: gas_used:177371
Diff: 66
Mac
gas consumed, operation, key/value
1000 ReadFlat ibc/fwd/0x14000ffda48
63 ReadPerByte ibc/fwd/0x14000ffda48 <- +3
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x140084adec8
630 WritePerByte ibc/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x140084adec8
78 ReadPerByte transfer/fwd/0x140084adec8 <- +3
0 ReadPerByte 
...
2000 WriteFlat transfer/fwd/0x140084adec8
780 WritePerByte transfer/fwd/0x140084adec8 <- +30
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
Linux server
1000 ReadFlat ibc/fwd/0x40010081a8
60 ReadPerByte ibc/fwd/0x40010081a8
42 ReadPerByte ports/transfer
...
2000 WriteFlat ibc/fwd/0x4005b534a8
600 WritePerByte ibc/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0
...
1000 ReadFlat transfer/fwd/0x4005b534a8
75 ReadPerByte transfer/fwd/0x4005b534a8
0 ReadPerByte 
...
2000 WriteFlat transfer/fwd/0x4005b534a8
750 WritePerByte transfer/fwd/0x4005b534a8
1380 WritePerByte capabilities/ports/transfer/channels/channel-0

It's clear that the key that has an extra 1 at the beginning is causing the issue.

To quickly replicate

package main

import (
	"fmt"
	capabilitytypes "github.com/cosmos/cosmos-sdk/x/capability/types"
)

func main() {
	cap := &capabilitytypes.Capability{}
	fmt.Println(string(capabilitytypes.FwdCapabilityKey("ibc", cap)))
}

Run the code above in a linux-arm64 (or any other) and then in a darwin-arm64.

ibc/fwd/0x4000c68290 <- linux-arm64
ibc/fwd/0x140004a7f00 <- darwin-arm64

So I don't know enough about these things to make a fix suggestion (unless removing any leading chars that exceed the length is an acceptable suggestion lol). We should find the Go docs that explain this behavior and act accordingly, but I couldn't find those docs.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@facundomedica facundomedica changed the title bug: Can't sync darwin-arm64 after creating IBC channel (different gas used) bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) Apr 22, 2022
@alexanderbez
Copy link
Contributor

alexanderbez commented Apr 22, 2022

@facundomedica your debugging skills are absolutely unrivaled. Thanks for opening up this issue. Unfortunately, i think there are some cases where we need to use pointers, especially for IBC caps. However, the plus side is that pretty much all operators run in the cloud or even bare metal with standard amd64 architecture. Which is why we haven't seen this in mainet most likely.

I'm not really sure what to do. In the meantime, I'll do some digging in the golang repo to see if there are any similar issues

Tagging folks for visibility.

@faddat @marbar3778 @jackzampolin @ValarDragon

@odeke-em
Copy link
Collaborator

Kindly cc-ing an ARM expert @elias-orijtech

@ValarDragon
Copy link
Contributor

It looks to me that the code is just straight-up unsafe.

https://github.com/cosmos/cosmos-sdk/blob/main/x/capability/types/keys.go#L41-L43

We just can't do %p in any state machine logic. This should not be expected to be consistent across machine architectures. (To be honest, I'm surprised its the same run after run on a single machine)

I don't understand why %p is used at all here to be honest. We should likely add a linter to prevent %p from entering any state machine logic.

@ValarDragon
Copy link
Contributor

ValarDragon commented Apr 22, 2022

I feel like everything thats trying to be achieved with pointers, could be achieved with a global counter in the context, and putting counter values in the CapabilityTypes struct.

@facundomedica
Copy link
Member Author

Another option would be to make it a fixed-length string with padding. So we make sure that never ever this happens again, at least in 64bit archs. It would increase the gas usage slightly, but it will give us peace of mind.
For this we wouldn't need a migration right? Given that is using the in-memory kv.

@ValarDragon
Copy link
Contributor

Thats true, that would short-term work, and can be added pretty quickly. Is doing that something we could get in v0.46.0 ? (cc @marbar3778 @AmauryM , though IDK whose the right person to tag for that question)

I do think the pointer formatting should be aimed to be removed though (perhaps in a separate issue), as imo its code smell tha may cause confusing problems in the future with debugging across any two distinct systems. (I imagine it alters by golang, and OS version as well)

@tac0turtle
Copy link
Member

Yes lets get this in!! @facundomedica would you like to open a pr?

@facundomedica
Copy link
Member Author

BTW, couldn't we just use the capability index instead of its pointer? According to the docs: The index provided to a Capability must be globally unique. (Idk 100% how this code works, just checking)

@ValarDragon
Copy link
Contributor

oh, if that already exists that seems far superior to use

@facundomedica
Copy link
Member Author

In that case, it's a trivial fix. But I'm afraid it was done like that on purpose, at least that's what I get from reading https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-003-dynamic-capability-store.md . Although I don't fully comprehend these docs 😅

@ValarDragon
Copy link
Contributor

Yeah, reading through that it feels like they were aware of the problem and added an index, but then also didn't consistently use the index?

Definitely seems to me like an index should work, the whole point of the capmap is theres a bijection between the pointer and the index. This definitely feels like a bug to me.

The only reason I could see for it being claimed to not be a bug would be that "the randomness of a pointer is a feature", which seems just wrong. If thats wanted, just get a random 64 bit padded number as the index. I suspect this is not the case, and its just an accident that slipped through review.

cc @AdityaSripal @colin-axner

@AdityaSripal
Copy link
Member

AdityaSripal commented Apr 25, 2022

Hmm I don't understand how the forward mapping itself could cause an apphash mismatch.

The forward key is stored in the memstore which is not committed in the final apphash. https://github.com/cosmos/cosmos-sdk/blob/main/store/mem/store.go#L56

The fact that the two machines have different pointers is expected behavior. It would be true even if they were 2 Linux machines.


the randomness of a pointer is a feature

This is the intention. See from the ADR

Object-capability keys are currently pointers (memory addresses) of StoreKey structs created at application initialisation in app.go (example) and passed to Keepers as fixed arguments (example). Keepers cannot create or store capability keys during transaction execution — although they could call NewKVStoreKey and take the memory address of the returned struct, storing this in the Merklised store would result in a consensus fault, since the memory address will be different on each machine (this is intentional — were this not the case, the keys would be predictable and couldn't serve as object capabilities).

@ValarDragon:

We just can't do %p in any state machine logic. This should not be expected to be consistent across machine architectures. (To be honest, I'm surprised its the same run after run on a single machine)

The pointers are only being stored in the local node's state. They are not expected to be the same across nodes. Nor is it expected to be the same between different runs (ie. if the node stops and restarts). In fact, it is intentionally part of the design that different nodes would have different pointers to the capability.

What is expected is that while the node is running, the memory location of the capability does not change from the time it was last initialized by the state machine. This possibly has changed with M1???

The reason it was designed this way, was explicitly to keep the capability outside of the state machine.

The reasoning was if the capability was an object in state, then a malicious module or user could simply copy that capability and pass it in.

Instead, the capability key was the in-memory pointer itself. Each node would generate a capability and store it in memory.

In order to authenticate an action that requires a given capability, a caller must pass in the capability (the exact same pointer), rather than just a copy of the capability. Each node will then check this pointer against their stored pointer value.

The caller can get the exact capability on the given node by requesting it from the capability keeper who will authenticate that the caller is authorized to get the original capability pointer.

Since every node is retrieving in-memory capabilities from its own local store, and verifying these capabilities against its own local store. The authentication is not happening directly in the state machine. Each node is independently doing its own in-memory capability check and then returning to the state machine whether the authentication passed.

Arguably this is overengineering given the current SDK security model between modules, but it was a requirement put into the ICS specification.


I'd be interested in seeing the exact tx response that occurs during the channel handshake on the Mac. It would be the first handshake msg to be executed on the chain (Either INIT or TRY). You should be able to retrieve it @facundomedica by querying the txhash that got committed in the channel handshake.

My guess is that the Linux machine is authenticating the capability locally correctly. But the Mac is not authenticating the capability correctly.

This would imply to me that the capability got stored in some memory location during initialization and then the Mac changed the memory location while the node was still running. So that when it was later retrieved the new pointer did not match the stored pointer. I'm not sure if that's possible but it would certainly break capabilities.


Thanks @facundomedica for an excellent breakdown and investigation!

@facundomedica
Copy link
Member Author

@AdityaSripal thanks for the detailed explanation! 🙏
About why the hash is different: It's indirectly caused by the FwdKey, because on a Mac it will be higher, when you convert it to hex you will get an extra byte stored in memory KV. That extra byte counts towards gas_used, which is used to calculate the hash.
We have this issue on a live testnet: https://explorer.umeemania-1.network.umee.cc/umee/blocks/1004 query there anything you need and lmk if you would like to put some log statements and I can do that locally :)

@alexanderbez
Copy link
Contributor

What @AdityaSripal explained is correct -- we used %p for a reason.

@AdityaSripal
Copy link
Member

Ahh interesting, so the issue is that the the memory store is metering gas.

Yea. we could standardize pointer length as a fix (pad or truncate to a standard length).

It shouldn't even be breaking, because as mentioned above, it is not committed to state

@alexanderbez
Copy link
Contributor

Yeah, but the in-mem store is gas metered and that affects gas_used which does affect the merkle-ized state :-/

So it's an indirect state change

@AdityaSripal
Copy link
Member

Ahh ok, yea I just saw the discussion on the PR. Yes that's annoying.

At least if you standardize to the same length that linux machines are currently using, It will be technically breaking, but effectively nodes could upgrade independently.

Agreed that we should consider if we should meter gas on memstores, but that's a larger discussion that probably doesn't need to block a fix here

@ValarDragon
Copy link
Contributor

ValarDragon commented Apr 26, 2022

I don't think the pointer decision should be deemed ok. To achieve the targetted goal (which I think is in a very weird model as you noted), instead of a global counter, it should use a randomly selected nonce and use that. Basically randomly sample a (say) 64 bit number, from a system seeded rng (random module works). Can easily resample on collisions, so no soundness error applies. (Though again, this threat model is weird, since it assumes an attacker has no memory access to this map... which is still at a fixed memory address)

The literal pointer value provides very little security guarantees, and has very few guarantees from golang itself. Golang doesn't do ASLR https://rain-1.github.io/golang-aslr.html, so it should be trivial to get the underlying pointer value from any occasion where you can do code insertion.

I've viewed the ocap model we use as providing protection, assuming "nicely" written modules, and protecting against accidental bad API usage. If thats still the general view we have of it, then both the counter or random sampling suffice, dependent on the desired level of protection

@alexanderbez
Copy link
Contributor

@ValarDragon that's a good point. A PRNG could equally work as well...I think. @AdityaSripal do you have any thoughts on this?

@facundomedica
Copy link
Member Author

PRNG could be the answer to all of this, we could even just truncate the number to match the current key length, that way we won't have the gas usage issue (the current fix changes that and increases it)

@alexanderbez
Copy link
Contributor

Let's do it. @facundomedica something you wanna tackle? No need for a migration either.

@facundomedica
Copy link
Member Author

Sure, should I revert the previous fix or just go at it in a new PR?

@ValarDragon
Copy link
Contributor

ValarDragon commented Apr 27, 2022

I'd recommend going at it in a new PR. If something weird happens and new PR is blocked for w/e reason, at least we already have something that can go in v0.46 to solve the consensus divergence

Making it match current key length is a great idea :)

@facundomedica
Copy link
Member Author

Couldn't dedicate much time to this, but these are my findings so far about using a PRNG:

  • We would need to store a link between the capability index and the random number (causing extra reads and writes, causing the gas usage to be higher)
  • Collisions will cause a consensus failure, given that we would need to try again to find another random number that hasn't been used (again increasing gas usage)

Some other options to consider:

  • Use base64 to encode the pointer instead of hex. <- This sounds quite solid IMO, so we just do the same thing we are doing now but the base64 allows us to encode bigger numbers with less chars (with just 12 bytes we could encode MAX UINT64).
  • hashing the pointer + the capability index? Using SHA256 and then using a fix number of chars of the hex representation?

@alexanderbez
Copy link
Contributor

alexanderbez commented May 2, 2022

Why do you think a PRNG would not suffice? I think a bit extra gas costs should be fine IMO. Also, where would there be anything related to consensus failure on collisions as all of the cap index stuff is local ephemeral in-mem anyway. When you ask x/cap for a index and there is a collision, just try again?

I like your idea of augmenting the current pointer approach, but I think @ValarDragon pointed out limitations about using pointers. Maybe that's moot though.

@facundomedica
Copy link
Member Author

I think that collisions would cause the same issue that's described here, in which for some nodes they'll get a random number that's not being used on the first time, and some others will have to retry. And retrying will get them to spend extra gas, which will result in a different gas usage than other nodes (because we are counting even the "exist" queries in the mem store).

@alexanderbez
Copy link
Contributor

And retrying will get them to spend extra gas

Why does regenerating another number cost more gas? AFAIK, only the CRUD operations (get, set, etc..) cost gas. So picking a PRN, should be zero cost.

@facundomedica
Copy link
Member Author

(@alexanderbez I was talking about using the memstore KV, which counts gas)

After a while of looking at the code, I think I'm not sure how to implement this fix tbh. I have some questions and would love some extra direction here, the solutions I'm thinking of are not addressing the issue.

FwdCapabilityKey is getting the pointer from the capability, if we want to use a random number, should we add another field to the capability? But wouldn't that defeat the entire purpose of having a random number if you could just pass any number you want?

@ValarDragon said the following, which I understand is referring to changing the Index for a random number. But this index global counter is being stored in the persistent KV as far as I can see.

To achieve the targetted goal (which I think is in a very weird model as you noted), instead of a global counter, it should use a randomly selected nonce and use that.

I was pretty confident that this was a straightforward fix (and maybe it is) but I got tangled up in the problem 😅

@alexanderbez
Copy link
Contributor

Ahhh, wait you're totally right. In order to determine if there is a collision, you have to perform a read lol.

@ValarDragon what exactly is your suggestion?

I mean technically the solution as it exists today works, it's just not safe cross-platform, so if we could devise the key to not use %p but perhaps some fixed-length encoding of it, then that would suffice.

@ValarDragon
Copy link
Contributor

This concrete problem could be solved by an in memory map, thats outside of whats metered by stores. (A map usedNonces map[int64]struct{})

I just think its bad form for us to rely on pointers as a source of RNG. This becomes the sort of thing that has edge cases / odditiess that haunt us far into the future.

I also question the entire threat model here though. I'd be in favor of just removing this entire pointer logic in the IBC spec. (As I noted in my post -- pointers in golang do almost nothing for security over a counter here)

Anyone have context on who was originally in favor of it? Want to make sure we get their perspective / can talk about what they perceive as the benefit.

@alexanderbez
Copy link
Contributor

I don't want to yak-shed @ValarDragon. If you think the current threat model is moot and not needed, let's open a separate issue/discussion for that 🙌

@facundomedica, I like @ValarDragon's suggestion. We can use a go-native map outside the gas metered memstore to hold the PRNs and thus avoid the gas costs. Can you go with that?

@facundomedica
Copy link
Member Author

I think I'm missing something (or everything lol) here, the answer is not obvious to me at all 😅
At the moment we use the capability's pointer to authenticate it. We would need to add a new field to it adding the PRN. So would adding an exported field (easily modifiable) helps us in any way?

@alexanderbez
Copy link
Contributor

The PRN would replace the pointer. So the cap store knows the mapping from capability to PRN.

@facundomedica
Copy link
Member Author

After some internal discussion with @alexanderbez we've come to the conclusion that this is not solvable with PRNs given that we would need to attach it somehow to the Capability object making it useless as a security feature. So for now #11737 is enough (fixed-length pointer encoding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants