Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update StateMap to take borrowee of keys #644

Closed
wants to merge 5 commits into from

Conversation

vlopes11
Copy link
Contributor

Description

This commit updates the API of StateMap to behave as common map implementations in Rust: the argument of a get method will receive a value that can be borrowed from the concrete key type.

This is useful for recurrent situations such as when we want to perform a get from a map of Vec<u8> without cloning the key index, by using a slice.

The focus of the PR is to touch the public API of the map and trait, without changing too much the internals. A new issue is introduced to propose an upgrade: #643

It also updates the public interface of the storage to take references by default, instead of owned values. It will allow the user to perform storage queries without necessarily cloning the values.

Linked Issues

Testing

An example of the new API is included as doctest

This commit updates the API of `StateMap` to behave as common map
implementations in Rust: the argument of a get method will receive a
value that can be borrowed from the concrete key type.

This is useful for recurrent situations such as when we want to perform
a get from a map of `Vec<u8>` without cloning the key index, by using a
slice.

The focus of the PR is to touch the public API of the map and trait,
without changing too much the internals. A new issue is introduced to
propose an upgrade: #643

It also updates the public interface of the storage to take references
by default, instead of owned values. It will allow the user to perform
storage queries without necessarily cloning the values.

Resolves #427
pub fn as_cache_key(&self) -> &CacheKey {
// Safety: they are currently equivalent
// TODO https://github.com/Sovereign-Labs/sovereign-sdk/issues/643
unsafe { core::mem::transmute(self) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't use #[repr(transparent)], I think rustc is allowed to use whatever layout it wants for storage keys; so while they're likely the same as cache keys in practice we can't guarantee that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there all just wrapper around arc, wouldn't cloning arc just work?

Copy link
Contributor Author

@vlopes11 vlopes11 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preston-evans98 I think it's ok for this case since the type declaration is the same. The transparent would be required when we convert from struct to enum, for example. In that case, the compiler would, maybe, use a different layout.

@citizen-stig strictly speaking, cloning the arc into an owned cache key would be just fine as internally it would just increment the reference count to the vec. I'll dive into more details of what I had in mind, but first, a preliminary snippet:

pub fn as_cache_key(&self) -> &CacheKey {
    &CacheKey { key: self.key }
}

The snippet above doesn't work because CacheKey is created inside the function, even tho it contains just a pointer that will be valid for the lifetime of StorageKey.

The idea of having references to the storage key instead of owned storage key has two main pillars:

  • The assumption of storage key being a pointer is an implementation detail of the struct; otherwise, it would implement Copy
  • According to the snippet below, we need a reference to the Arc to satisfy the API constraints
use std::{borrow::Borrow, marker::PhantomData, sync::Arc};

#[derive(Default)]
pub struct Map<K, V> {
    _data: PhantomData<(K, V)>,
}

impl<K, V> Map<K, V> {
    pub fn get<Q>(&self, _k: &Q) -> Option<V>
    where
        Q: ?Sized,
        K: Borrow<Q>,
    {
        todo!()
    }
}

Map::<Vec<u8>, u8>::default().get(&[0][..]); // works fine
Map::<Arc<Vec<u8>>, u8>::default().get(&Arc::new(vec![0u8])); // works too since arc borrows
                                                              // whatever T borrows
Map::<Arc<Vec<u8>>, u8>::default().get(Arc::new(vec![0u8])); // doesn't work since we expect a
                                                             // reference to a borrowable type

The idea of making StorageKey always a reference challenges the assumption passed to the user that cloning this key is fine (in reality it is, but as mentioned above, it is an implementation detail of the struct).

Also, the idea of making StorageKey always a reference is more ergonomic to work with map keys that will, by default, expect references of generics.

Wdyt of this approach?


There is also a rather questionable point with the PR. The previous API was using as_cache_key to convert owned -> owned. Even tho it is a sort of free conversion (pointer to pointer), it is still owned to owned and maybe it should be to_* 🤔

Copy link
Contributor Author

@vlopes11 vlopes11 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, this solution is meant to be a cheap one until we decide what to do with #643 . If we unify the types, we could even implement copy for the struct and treat it as a pointer.

The intent was to preserve a change scope separation to keep the changeset of the PR minimal, but we can easily integrate the solution of #643 into this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok for this case since the type declaration is the same. The transparent would be required when we convert from struct to enum, for example. In that case, the compiler would, maybe, use a different layout.

Logically I agree, espeically that there's only single inner attribute.

But I still think that this reasoning might be not 100% sound, because instead of relying on contract with compiler, we rely on our understanding of compiler.
Maybe let's just add repr(transparent) to be on the safe side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I added the transparent directive and a test-only assertion that will pass only if the two structs are exactly equal

7aae44d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap, two options were proposed here:

  1. Clone the Arc.
  2. std::mem::transmute.

I appreciate the the safeguards we added here, and I'm actually convinced it's safe (MIRI agrees), but we're still relying on the internals of the two structs as an implementation detail during the transmute. The first option seems simpler to understand, 100% safe, and quite intuitive. It relies on an implementation detail, which is unfortunate, but the same can be said about this unsafe.

I'd be for dropping this unsafe and clone the Arc instead, which is a safer stopgap until #643 lands.

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #644 (7aae44d) into nightly (3cc6083) will increase coverage by 0.4%.
The diff coverage is 95.1%.

Files Changed Coverage Δ
...m/utils/sov-first-read-last-write-cache/src/lib.rs 20.0% <ø> (ø)
module-system/sov-state/src/internal_cache.rs 64.7% <88.8%> (ø)
module-system/sov-state/src/scratchpad.rs 93.3% <90.0%> (-0.1%) ⬇️
module-system/sov-state/src/storage.rs 97.2% <90.0%> (-1.2%) ⬇️
module-system/sov-state/src/map.rs 98.4% <97.7%> (+0.8%) ⬆️
module-system/sov-state/src/prover_storage.rs 84.6% <100.0%> (+0.2%) ⬆️
module-system/sov-state/src/zk_storage.rs 57.6% <100.0%> (ø)

... and 4 files with indirect coverage changes

/// {
/// map.get(&key[..], ws)
/// }
/// ```
Copy link
Member

@bkolad bkolad Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an important hidden assumption we need to address. The standard requirement for Borrow is" Eq, Ord, and Hash must be equivalent for borrowed and owned values We are adding another requirement here: serialized version of borrowed and owned value must be equivalent. So it could happen that we have 2 types, A that can be borrowed as B, everything works for some codec, but if someone decides to change the codec to something else via #648 and it will stop working because for example vec and &[] have different representation.

Is it even true for borsh?
Like Vec<_> can be borrowed as slice but it looks like they have a different serialization logic for ZST:

@citizen-stig @preston-evans98 @neysofu WDT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding another requirement here: serialized version of borrowed and owned value must be equivalent.

Yes, by adding this requirement we allow to reduce cloning.
Question is, can we express this restriction in the code, so users won't get hit by

Is it even true for borsh?

But what if you have ZST as key in the first place? Would it suppose to break?

So I have this test and it looks like this;

   use sov_state::{DefaultStorageSpec, Prefix, ProverStorage, StateMap};
    use super::*;

    #[test]
    fn check_something() {
        let tmpdir = tempfile::tempdir().unwrap();
        let mut working_set =
            WorkingSet::new(ProverStorage::<DefaultStorageSpec>::with_path(tmpdir.path()).unwrap());

        let prefix = Prefix::new("test".as_bytes().to_vec());
        let map_1: StateMap<Vec<()>, u8> = StateMap::new(prefix.clone());

        let d1 = [(); 2 ^ 64].to_vec();

        map_1.set(&d1, &3, &mut working_set);
        let a = map_1.get(&d1, &mut working_set);
        println!("A: {:?}", a);

        let map_2: StateMap<&'static [()], u8> = StateMap::new(prefix);

        let d2 = [(); 2 ^ 64];
        map_2.set(&&d2[..], &3, &mut working_set);

        let b = map_2.get(&&d2[..], &mut working_set);
        println!("B: {:?}", b);

        assert!(a.is_some());
        assert!(b.is_some());
        assert_eq!(a, b);
    }

Nothing blew up. But probably I misunderstood question.

I hope we can agree that there's usability problem. Often you have only reference to slice of bytes and you need to use it as key, so you have to clone it. Then during set it is cloned again:

impl StorageKey {
    pub fn new<K: BorshSerialize>(prefix: &Prefix, key: &K) -> Self {
        let encoded_key = key.try_to_vec().unwrap();

So we trade off API purity for double cloning. We need to understand that.
And in current scenario, we push burden to our customers. So I think it is in our interest to find solution to this problem. Maybe we need to add some constraints to codec type, so it won't be easy to use coded which has different serialization. Maybe each codec have to implement those kind of checks for types that can be keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not strong reason, but how likely it is that someone will use ZST for keys?

Also, can we limit CacheKey current implementation not to all AsRef but only to Vec and slice of bytes. That might be option, since that will be most probable use case.

Copy link
Member

@bkolad bkolad Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@citizen-stig
So this

map.get(&vec[()]) 

panics while

map.get(&[();1]);

won't. And in the test probably we could get out of memory when we try to deserialize Vec<()> values which are serialized as very big slices instead of keys.

So it is inconsistent.
The ZST and slice vs vec is just an example of a bigger problem. We added another requirement to Borrow, and all the serialization formats in cartes.io don't care about it (like for bincode something else might break). To use it safely a user has to examine all types which can be borrowed and check if the implementation of the sterilization format they choose doesn't break the assumption (and won't break it in the future). We already saw that it is a broken event in the simplest case.

We could have another trait OurBorrow with the additional requirement documented but still, 3rd party serialization crates won't care.

For HashMap this kind of API works, because everyone in the rust ecosystem agreed on Eq, Ord, and Hash, must be equivalent for borrowed and owned value but this is just a convention. You can easily implement Borrow & Eq or Hash for your type which will break the HashMap. There is no way to express it in type system.

Copy link
Member

@neysofu neysofu Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@citizen-stig your example gave me an idea. You tested Vec<()> against &[()], which have the same encodings in Borsh, so it succeeds. What about arrays, though? Arrays don't have a length marker because there's no need for those, yet they Borrow to slices. This test fails:

#[test]
fn check_something() {
    use super::*;
    use crate::{DefaultStorageSpec, Prefix, ProverStorage};

    let tmpdir = tempfile::tempdir().unwrap();
    let mut working_set =
        WorkingSet::new(ProverStorage::<DefaultStorageSpec>::with_path(tmpdir.path()).unwrap());

    let prefix = Prefix::new("test".as_bytes().to_vec());

    type Key = [u8; 4];
    let map: StateMap<Key, u32> = StateMap::new(prefix.clone());

    let key = [0u8; 4];
    let key_ref_array: &[u8; 4] = <Key as Borrow<[u8; 4]>>::borrow(&key);
    let key_ref_slice: &[u8] = <Key as Borrow<[u8]>>::borrow(&key);

    // We set with &[u8; 4]...
    map.set(key_ref_array, &3, &mut working_set);
    // ...getting with &[u8; 4] works, obviously...
    assert!(map.get(key_ref_array, &mut working_set).is_some());
    // ...but changing the target of the Borrow to &[u8] doesn't.
    assert!(map.get(key_ref_slice, &mut working_set).is_some());
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkolad I didn't see your comment before posting mine, but yeah basically +1.

Copy link
Member

@neysofu neysofu Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility would be to extend StateKeyCodec and shift the responsibility of ensuring that K and Q serialize to the same byte sequence to the codec author, instead of the caller.

pub trait StateKeyCodec<K, Q = K> {
    type KeyError: std::fmt::Debug;

    fn encode_key(&self, key: &Q) -> Vec<u8>;

    fn try_decode_key(&self, bytes: &[u8]) -> Result<K, Self::KeyError>;

    // ...
}

Effectively, the default type parameter Q = K would be an opt-in mechanism. We'd have to modify this PR to allow only the types Q for which the codec is valid: impl<K, V, C, Q> StateMap<K, V, C> where C: StateKeyCodec<K, Q> + StateValueCodec<V>.

Blanket implementations like the following won't support the borrowee pattern, and it'll be up to us to choose when and for which types to add support:

impl<K> StateKeyCodec<K> for BorshCodec
where
    K: borsh::BorshSerialize + borsh::BorshDeserialize,
{
   // ...
}

I'd have to think it more through, but I think it would work. Whether or not it's worth it is another story.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkolad

So

map.get(&vec[()]) 

// panics while

Does not panic for me:

        type Key = Vec<()>;
        let map: StateMap<Key, u32> = StateMap::new(prefix.clone());

        let a = map.get(&vec![()], &mut working_set);
        assert!(a.is_none())

@neysofu , thank you for your example, I can confirm that it does not fetch key correctly =(.

I just want to reiterate my point. I am not trying to prove that ZST types suppose to work. Or that serialization adds extra constraints.

I just want to empathize, that there's undeniably usability problem for people in a simple case of &Vec<u8> or &String. And we have 2 options:

  1. "Sorry devs, that's life, deal with it".
  2. Try to find some practical solution, that will improve usability and bring value to the users.

In my opinion second option is preferable, even if it is harder.
Yes, people can brake HashMap, and that's fine, because everyday usage is more important, than edge cases.
I think the same way if we push for this convention, it can be easier to use our SDK everyday.

So to use it safely a user has to examine all types which can be borrowed and check if the implementation of the sterilization format they choose doesn't break the assumption (and won't break it in the future). We already saw that it is a broken event in the simplest case.

Can you please provide another example, besides ZST, which can brake it?

all types which can be borrowed

All types that are used as keys.

(and won't break it in the future)

Breaking change in serialization library is a big thing on itself. If you have running system and serialization for Vec has changed, you cannot just upgrade.

Anyway.

Other options that we can use:

  • some automatic way to check code base, that anything used in a key, should have same serialization with given coded. We can even add macro that will generate those tests.
  • Update our documentation to make sure that if people select something special for the case, that they should be carefule.

Copy link
Member

@bkolad bkolad Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@citizen-stig we are on the old version of borsh, the check and panic is in the latest master.
I am not against improving it if we can I just don't see any way.
If @neysofu solution works and we don't complicate the code much then ok but I would avoid relaying on documentation for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try putting something together 👍

Comment on lines +47 to 49
pub fn to_cache_key(self) -> CacheKey {
CacheKey { key: self.key }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vlopes11 I agree with your previous point here that this method is a bit of a misnomer. I think it should be named into_cache_key because it's an owned-to-owned conversion: https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv

pub fn as_cache_key(&self) -> &CacheKey {
// Safety: they are currently equivalent
// TODO https://github.com/Sovereign-Labs/sovereign-sdk/issues/643
unsafe { core::mem::transmute(self) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap, two options were proposed here:

  1. Clone the Arc.
  2. std::mem::transmute.

I appreciate the the safeguards we added here, and I'm actually convinced it's safe (MIRI agrees), but we're still relying on the internals of the two structs as an implementation detail during the transmute. The first option seems simpler to understand, 100% safe, and quite intuitive. It relies on an implementation detail, which is unfortunate, but the same can be said about this unsafe.

I'd be for dropping this unsafe and clone the Arc instead, which is a safer stopgap until #643 lands.

@neysofu
Copy link
Member

neysofu commented Aug 23, 2023

Closing in favor of #718 🎉

@neysofu neysofu closed this Aug 23, 2023
@vlopes11 vlopes11 deleted the vlopes11/feature/state-map-borrow branch September 29, 2023 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow SovState use Borrow to behave similar to std::collections::HashMap
5 participants