-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Can't sync darwin-arm64 after creating IBC channel (different gas used / pointer addresses are different) #11726
Comments
@facundomedica your debugging skills are absolutely unrivaled. Thanks for opening up this issue. Unfortunately, i think there are some cases where we need to use pointers, especially for IBC caps. However, the plus side is that pretty much all operators run in the cloud or even bare metal with standard amd64 architecture. Which is why we haven't seen this in mainet most likely. I'm not really sure what to do. In the meantime, I'll do some digging in the golang repo to see if there are any similar issues Tagging folks for visibility. |
Kindly cc-ing an ARM expert @elias-orijtech |
It looks to me that the code is just straight-up unsafe. https://github.com/cosmos/cosmos-sdk/blob/main/x/capability/types/keys.go#L41-L43 We just can't do I don't understand why %p is used at all here to be honest. We should likely add a linter to prevent |
I feel like everything thats trying to be achieved with pointers, could be achieved with a global counter in the context, and putting counter values in the |
Another option would be to make it a fixed-length string with padding. So we make sure that never ever this happens again, at least in 64bit archs. It would increase the gas usage slightly, but it will give us peace of mind. |
Thats true, that would short-term work, and can be added pretty quickly. Is doing that something we could get in v0.46.0 ? (cc @marbar3778 @AmauryM , though IDK whose the right person to tag for that question) I do think the pointer formatting should be aimed to be removed though (perhaps in a separate issue), as imo its code smell tha may cause confusing problems in the future with debugging across any two distinct systems. (I imagine it alters by golang, and OS version as well) |
Yes lets get this in!! @facundomedica would you like to open a pr? |
BTW, couldn't we just use the capability index instead of its pointer? According to the docs: |
oh, if that already exists that seems far superior to use |
In that case, it's a trivial fix. But I'm afraid it was done like that on purpose, at least that's what I get from reading https://github.com/cosmos/cosmos-sdk/blob/main/docs/architecture/adr-003-dynamic-capability-store.md . Although I don't fully comprehend these docs 😅 |
Yeah, reading through that it feels like they were aware of the problem and added an index, but then also didn't consistently use the index? Definitely seems to me like an index should work, the whole point of the The only reason I could see for it being claimed to not be a bug would be that "the randomness of a pointer is a feature", which seems just wrong. If thats wanted, just get a random 64 bit padded number as the index. I suspect this is not the case, and its just an accident that slipped through review. |
Hmm I don't understand how the forward mapping itself could cause an apphash mismatch. The forward key is stored in the memstore which is not committed in the final apphash. https://github.com/cosmos/cosmos-sdk/blob/main/store/mem/store.go#L56 The fact that the two machines have different pointers is expected behavior. It would be true even if they were 2 Linux machines.
This is the intention. See from the ADR
The pointers are only being stored in the local node's state. They are not expected to be the same across nodes. Nor is it expected to be the same between different runs (ie. if the node stops and restarts). In fact, it is intentionally part of the design that different nodes would have different pointers to the capability. What is expected is that while the node is running, the memory location of the capability does not change from the time it was last initialized by the state machine. This possibly has changed with M1??? The reason it was designed this way, was explicitly to keep the capability outside of the state machine. The reasoning was if the capability was an object in state, then a malicious module or user could simply copy that capability and pass it in. Instead, the capability key was the in-memory pointer itself. Each node would generate a capability and store it in memory. In order to authenticate an action that requires a given capability, a caller must pass in the capability (the exact same pointer), rather than just a copy of the capability. Each node will then check this pointer against their stored pointer value. The caller can get the exact capability on the given node by requesting it from the capability keeper who will authenticate that the caller is authorized to get the original capability pointer. Since every node is retrieving in-memory capabilities from its own local store, and verifying these capabilities against its own local store. The authentication is not happening directly in the state machine. Each node is independently doing its own in-memory capability check and then returning to the state machine whether the authentication passed. Arguably this is overengineering given the current SDK security model between modules, but it was a requirement put into the ICS specification. I'd be interested in seeing the exact tx response that occurs during the channel handshake on the Mac. It would be the first handshake msg to be executed on the chain (Either INIT or TRY). You should be able to retrieve it @facundomedica by querying the txhash that got committed in the channel handshake. My guess is that the Linux machine is authenticating the capability locally correctly. But the Mac is not authenticating the capability correctly. This would imply to me that the capability got stored in some memory location during initialization and then the Mac changed the memory location while the node was still running. So that when it was later retrieved the new pointer did not match the stored pointer. I'm not sure if that's possible but it would certainly break capabilities. Thanks @facundomedica for an excellent breakdown and investigation! |
@AdityaSripal thanks for the detailed explanation! 🙏 |
What @AdityaSripal explained is correct -- we used |
Ahh interesting, so the issue is that the the memory store is metering gas. Yea. we could standardize pointer length as a fix (pad or truncate to a standard length). It shouldn't even be breaking, because as mentioned above, it is not committed to state |
Yeah, but the in-mem store is gas metered and that affects gas_used which does affect the merkle-ized state :-/ So it's an indirect state change |
Ahh ok, yea I just saw the discussion on the PR. Yes that's annoying. At least if you standardize to the same length that linux machines are currently using, It will be technically breaking, but effectively nodes could upgrade independently. Agreed that we should consider if we should meter gas on memstores, but that's a larger discussion that probably doesn't need to block a fix here |
I don't think the pointer decision should be deemed ok. To achieve the targetted goal (which I think is in a very weird model as you noted), instead of a global counter, it should use a randomly selected nonce and use that. Basically randomly sample a (say) 64 bit number, from a system seeded rng ( The literal pointer value provides very little security guarantees, and has very few guarantees from golang itself. Golang doesn't do ASLR https://rain-1.github.io/golang-aslr.html, so it should be trivial to get the underlying pointer value from any occasion where you can do code insertion. I've viewed the ocap model we use as providing protection, assuming "nicely" written modules, and protecting against accidental bad API usage. If thats still the general view we have of it, then both the counter or random sampling suffice, dependent on the desired level of protection |
@ValarDragon that's a good point. A PRNG could equally work as well...I think. @AdityaSripal do you have any thoughts on this? |
PRNG could be the answer to all of this, we could even just truncate the number to match the current key length, that way we won't have the gas usage issue (the current fix changes that and increases it) |
Let's do it. @facundomedica something you wanna tackle? No need for a migration either. |
Sure, should I revert the previous fix or just go at it in a new PR? |
I'd recommend going at it in a new PR. If something weird happens and new PR is blocked for w/e reason, at least we already have something that can go in v0.46 to solve the consensus divergence Making it match current key length is a great idea :) |
Couldn't dedicate much time to this, but these are my findings so far about using a PRNG:
Some other options to consider:
|
Why do you think a PRNG would not suffice? I think a bit extra gas costs should be fine IMO. Also, where would there be anything related to consensus failure on collisions as all of the cap index stuff is local ephemeral in-mem anyway. When you ask x/cap for a index and there is a collision, just try again? I like your idea of augmenting the current pointer approach, but I think @ValarDragon pointed out limitations about using pointers. Maybe that's moot though. |
I think that collisions would cause the same issue that's described here, in which for some nodes they'll get a random number that's not being used on the first time, and some others will have to retry. And retrying will get them to spend extra gas, which will result in a different gas usage than other nodes (because we are counting even the "exist" queries in the mem store). |
Why does regenerating another number cost more gas? AFAIK, only the CRUD operations (get, set, etc..) cost gas. So picking a PRN, should be zero cost. |
(@alexanderbez I was talking about using the memstore KV, which counts gas) After a while of looking at the code, I think I'm not sure how to implement this fix tbh. I have some questions and would love some extra direction here, the solutions I'm thinking of are not addressing the issue.
@ValarDragon said the following, which I understand is referring to changing the Index for a random number. But this index global counter is being stored in the persistent KV as far as I can see.
I was pretty confident that this was a straightforward fix (and maybe it is) but I got tangled up in the problem 😅 |
Ahhh, wait you're totally right. In order to determine if there is a collision, you have to perform a read lol. @ValarDragon what exactly is your suggestion? I mean technically the solution as it exists today works, it's just not safe cross-platform, so if we could devise the key to not use |
This concrete problem could be solved by an in memory map, thats outside of whats metered by stores. (A map I just think its bad form for us to rely on pointers as a source of RNG. This becomes the sort of thing that has edge cases / odditiess that haunt us far into the future. I also question the entire threat model here though. I'd be in favor of just removing this entire pointer logic in the IBC spec. (As I noted in my post -- pointers in golang do almost nothing for security over a counter here) Anyone have context on who was originally in favor of it? Want to make sure we get their perspective / can talk about what they perceive as the benefit. |
I don't want to yak-shed @ValarDragon. If you think the current threat model is moot and not needed, let's open a separate issue/discussion for that 🙌 @facundomedica, I like @ValarDragon's suggestion. We can use a go-native map outside the gas metered memstore to hold the PRNs and thus avoid the gas costs. Can you go with that? |
I think I'm missing something (or everything lol) here, the answer is not obvious to me at all 😅 |
The PRN would replace the pointer. So the cap store knows the mapping from capability to PRN. |
After some internal discussion with @alexanderbez we've come to the conclusion that this is not solvable with PRNs given that we would need to attach it somehow to the Capability object making it useless as a security feature. So for now #11737 is enough (fixed-length pointer encoding). |
Summary of Bug
TL;DR:
capabilitytypes.FwdCapabilityKey
produces different results on darwin-arm64A binary compiled for the Apple M1 chip (darwin-arm64) is not compatible with other architectures, causing consensus failure. It happens after a new IBC channel is created (I'm not sure if it's
ibc.core.client.v1.MsgUpdateClient
ibc.core.channel.v1.MsgChannelOpenTry
causing it).This issue seems to affect only darwin-arm64 and not linux-arm64.
Version
v0.45.2
Steps to Reproduce
Debugging
I tracked down this issue to a mismatch in gas used caused by a "difference in encoding".
Check below the operations and the data they are working with.
It's clear that the key that has an extra 1 at the beginning is causing the issue.
To quickly replicate
Run the code above in a linux-arm64 (or any other) and then in a darwin-arm64.
So I don't know enough about these things to make a fix suggestion (unless removing any leading chars that exceed the length is an acceptable suggestion lol). We should find the Go docs that explain this behavior and act accordingly, but I couldn't find those docs.
For Admin Use
The text was updated successfully, but these errors were encountered: