Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Rust lazy reader #77

Merged
merged 15 commits into from
Apr 3, 2024

Conversation

XuJiandong
Copy link
Collaborator

@XuJiandong XuJiandong commented Aug 11, 2023

In the current implementation, the molecule requires that all data be loaded into memory before deserialization. This is a significant limitation in smart contracts as ckb-vm only has 4M memory. This PR aims to resolve this issue by implementing a lazy reader.

If we examine the Molecule Spec, we can retrieve specific data by navigating through "hops". By reading only the header, we can estimate where to navigate and avoid reading the rest of the data. In many scenarios where only certain parts of the data are required, a lazy reader mechanism can be utilized.

Reading data from a data source via syscall is costly. It would be advantageous to read more data for future use with each syscall. Currently, there is support for caching every reading. For further information, refer to read_at (in Rust).

Main changes:

  1. Add a new plugin, named "rust-lazy-reader"
  2. Add test cases
  3. Add fuzzing test

Finally, it is ported from original repo: https://github.com/XuJiandong/moleculec-c2

@XuJiandong XuJiandong marked this pull request as ready for review August 17, 2023 03:00
bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
bindings/rust/src/lazy_reader.rs Show resolved Hide resolved
@quake
Copy link
Member

quake commented Aug 22, 2023

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:

let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)

But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

@quake
Copy link
Member

quake commented Aug 22, 2023

could you implement the iterator for lazy reader:

pub fn iter<'t>(&'t self) -> #reader_iterator<'t, 'r> {

it's useful to write code like:

let tx = lazy_load_tx();
for input in tx.inputs().iter() {
   ...
}

@XuJiandong
Copy link
Collaborator Author

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:

let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)

But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

@quake
Copy link
Member

quake commented Aug 23, 2023

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

I mean the lazy reader lacks a way to verify the data format of the slice after lazy loading.

@XuJiandong
Copy link
Collaborator Author

XuJiandong commented Aug 23, 2023

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

I mean the lazy reader lacks a way to verify the data format of the slice after lazy loading.

We can verify format like this:

fn verify(slice: &[u8], compatible: bool) -> molecule::error::VerificationResult<()> {

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
@XuJiandong
Copy link
Collaborator Author

XuJiandong commented Aug 25, 2023

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:

let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)

But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

Find the example here: https://github.com/XuJiandong/molecule/blob/b23d1f92df8944d1286dd6f7c93a218edf66d98e/bindings/rust/src/lazy_reader.rs#L485

With cursor, we can construct data structures directly or via Into trait.

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved
}

impl LazyReaderGenerator for ast::Array {
fn gen_rust<W: io::Write>(&self, output: &mut W) -> io::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For array of bytes, please add an additional function so one can read the full data directly into a memory buffer, see this for an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way of solving this problem, is that we can add Iterator support for fixvec as well.

`out of bound` will occur when `reader` try to read the data beyond that.
reader: interface to read underlying data
*/
pub fn new(total_size: usize, reader: Box<dyn Read>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about this API: it suits the case when you are initializing a Cursor from a Vec<u8> structure, but what if you are initializing in a smart contract, from a syscall?

We would need to first build a reader implementing Read trait by reading syscalls, however, to initialize a Cursor, you would also need total_size, which is doable by... doing a syscall.

Thinking it another way, a DataSource can well be initialized by a single syscall, which loads enough data for the cache field, and also gets the total size of data in IO.

By requiring total_size here, we first need a syscall to get the total size, then DataSource requires another syscall to fill in the cache part. We are wasting one syscall here.

This is worth thinking, when refactoring the code related to DataSource.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total_size is required for safety/security reason: we can always check its bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on this: why total_size enhances security?

bindings/rust/src/lazy_reader.rs Show resolved Hide resolved
Ok(())
}

pub fn validate(&self) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to revisit here, and distinguish functions that are used internally, and externally used functions. It really seems to me that this validate function, is used internally in the Cursor? In that sense there is not need to put that pub here.

And there might also be other methods in this structure that are only required internally, so we might want to think about those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the comment below, I don't really see any operation that cannot be covered by a slicing operation. We can provide a single slicing operation for cutting a cursor into any desirable range. Then the only place we need to do validation, is within the slicing operation.

)))
} else {
let offset = calculate_offset(item_size, item_index, NUMBER_SIZE)?;
cur2.add_offset(offset)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of patterns like this, where we clone a cursor, add something to its offset, then manipulate the size by directly changing it, or via sub_size. Are we really doing slicing here? Can we fix those code by a single slicing operation(and also removing both add_offset and sub_size)? That will simplify the code a lot and reduce the attack vector

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me a slicing operation is all you need here. Depending on the exact API of slicing operation, we can do one of the following 2:

mol2_slice_by_length(cursor, offset, cursor->length() - offset);

Or

mol2_slice_by_end(cursor, offset, cursor->end());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check "calculate_offset" carefully: they shows in code 4 times and shares nothing common.
We can't get any benefit doing so. note: there are slice_by_offset, slice_by_index already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agre with this assessment. We are mixing 2 independent things together:

  • How do I safely get a slice of the data from a cursor, given an offset and either a length, or and end value?
  • Molecule uses indices to maintain molecule offsets, how do I calculate the offsets of a certain field?

The fact that we have slice_by_offset, slice_by_index and other functions all working on a single cursor type feels extremely messy to me. If I were doing this, I would design 2 separate components:

  • A cursor only has a slicing operation that takes an offset and a length, internally, it checks if the offset and length are vaild, if not, an error would be thrown. If they are indeed valid, a new cursor will be returned.
  • An offset calculation function that only calculates the offset for dynvec or a table. If you really want it, you can have slice_by_index function that first calculates the offset, then calls the slicing operation above to build the final cursor. I personally find it messy and possible dangerous to have a dynvec_slice_by_index function, that directly manipulates a cursor, this is not a proper abstraction. When the cursor logic requires altering, we will need to touch a ton of functions here, ensuring they all obey cursor's validation rules. This is a nightmare to maintain.

}

pub fn slice_by_offset(&self, offset: usize, size: usize) -> Result<Cursor, Error> {
let mut cur2 = self.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code reads better if we simply calculate the new offset, then create a new cursor using new offset, size, and data source.

@yangby-cryptape
Copy link
Collaborator

yangby-cryptape commented Jan 16, 2024

@quake @xxuejie @eval-exec @XuJiandong

Hi, folks, I don't use this feature, so I don't have opinions on this PR, so would you reach agreement on this PR.
We can merge it in next version (0.9.x).

@XuJiandong
Copy link
Collaborator Author

@quake @xxuejie @eval-exec @XuJiandong

Hi, folks, I don't use this feature, so I don't have opinions on this PR, so would you reach agreement on this PR. We can merge it in next version (0.9.x).

I will address the issues raised by @xxuejie when I have the time. Currently, I'm busy with other matters.

XuJiandong and others added 3 commits January 31, 2024 16:01
* Support iterator
* Change read_at as method
* sub_size returns Result
* Check alignment on get_item_count
* Snake names
* Some typos
* Add verify methods
* Add iterator test cases
* Make lazy_reader.rs panic free
* Add impl_cursor_primitive
* Format error strings
* support custom unicode id
* Array get item return Vec<u8>
* union uses Enum
@XuJiandong
Copy link
Collaborator Author

XuJiandong commented Jan 31, 2024

In the latest commit, We have improved following:

  • Support custom union id
  • Array is returned as [u8; N]
  • Use Enum as union id

quake and others added 4 commits March 18, 2024 17:59
* Refactor errors
Remove strings because it takes more space in contracts.
* Check read_len == 0 on lazy_reader.rs: read_at
* Add test case for zero byte union
* Add clone
* fix length on read
* add slice_by_start

---------

Co-authored-by: joii2020 <87224197+joii2020@users.noreply.github.com>
@XuJiandong XuJiandong requested a review from joii2020 April 3, 2024 02:46
@yangby-cryptape yangby-cryptape merged commit 9190849 into nervosnetwork:master Apr 3, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants