Skip to content

Improved C plugin for the molecule serialization system

Notifications You must be signed in to change notification settings

XuJiandong/moleculec-c2

Repository files navigation

moleculec-c2

moleculec-c2

Improved C/Rust plugin for the molecule serialization system. Read as molecule c c2: the first c means compiler, c2 is the code name. We already have a moleculec which is used here.

How to use

  • Install "moleculec"
make install-tools
  • Compile Rust code to binary
cargo build --release
  • Generate C/Rust files by moleculec-c2 and moleculec
# generate intermedia json file
moleculec --language - --schema-file mol/blockchain.mol --format json > mol/blockchain.json
# generate C
target/release/moleculec-c2 --input mol/blockchain.json | clang-format -style=Google > tests/blockchain/blockchain-api2.h
# generate Rust
target/release/moleculec-c2 --rust --input mol/blockchain.json | rustfmt > tests/blockchain_rust/src/blockchain.rs
  • Include the generated file to your source file

The json file is intermedia file.
clang-format -style=Google or rustfmt is not needed if you don't care about coding style.


The following are optimized compared to the old C/Rust API:

Strong type for C

If we look into the code of old molecule API usage, we find that mol_seg_t is everywhere: it's like a weak type in dynamic languages(Python, lua). We can't use type system of C compilers to check whether we use the API correctly. With new API, we can use the type system help us to reduce possibilities for bugs,
checking that the code is written in a consistent way, giving hint while coding. Here is an example usage of blockchain. And browse the generated API for blockchain.

Extra support for known types

From the Encoding Spec, we know that there is no types system in molecule. For example, we can find the following definitions in molecule:

array Uint32 [byte; 4];
array Uint64 [byte; 8];

We now have "version" with type "Uint32". But with old molecule API, the API still returns "uint_8*" instead of "uint32_t".

Now the following type names are reserved for types:

  • Uint8, Int8
  • Uint16, Int16
  • Uint32, Int32
  • Uint64, Int64

When they appear in schema file, it is automatically converted to the corresponding types in the generated files. Here are the mapping list:

Molecule type Type name C Type Rust Type
byte / uint8_t u8
[byte; 1] int8 int8_t i8
[byte; 1] uint8 uint8_t u8
[byte; 2] int16 int16_t i16
[byte; 2] uint16 uint16_t u16
[byte; 4] int32 int32_t i32
[byte; 4] uint32 uint32_t u32
[byte; 8] int64 int64_t i64
[byte; 8] uint64 uint64_t u64
[byte; N] / mol2_cursor_t Cursor
<byte> / mol2_cursor_t Cursor
option / / Option<_>

The type name is case-insensitive. For example, int8, Int8, INT8 are all mapped to int8_t.

Load memory on demand

mol_seg_t, is the most important data structure in old molecule API:

typedef struct {
    uint8_t                     *ptr;               // Pointer
    mol_num_t                   size;               // Full size
} mol_seg_t;

It comes with an assumption: the data has been loaded into memory already. It's not a good design to system like CKB-VM which only has very limited memory (4M).

As we look into the Molecule Spec, if we only need some part of data, we can get the data through some "hops". We can read the header only, estimating where to hop and don't need to read the remaining data. For a lot of scenarios which only need some part of data, we can have a load-on-demand mechanic.

This load-on-demand mechanic is introduced by the following data structure:

typedef struct mol2_cursor_t {
  uint32_t offset;  // offset of slice
  uint32_t size;    // size of slice
  mol2_data_source_t *data_source;
} mol2_cursor_t;

We have a very simple implementation of "read" field over memory:

uint32_t mol2_source_memory(uintptr_t args[], uint8_t *ptr, uint32_t len,
                            uint32_t offset) {
  uint32_t mem_len = (uint32_t)args[1];
  ASSERT(offset < mem_len);
  uint32_t remaining_len = mem_len - offset;

  uint32_t min_len = MIN(remaining_len, len);
  uint8_t *start_mem = (uint8_t *)args[0];
  ASSERT((offset + min_len) <= mem_len);
  memcpy(ptr, start_mem + offset, min_len);
  return min_len;
}

We can also make another one based on syscall.

When "mol2_cursor_t" is returned from functions, it doesn't access memory. As the name "cursor" suggests, it's only an cursor. We can access memory on demand by "mol2_read_at", for example:

    mol2_cursor_t witness_cur = witnesses.tbl->at(&witnesses, 0);
    uint8_t witness[witness_cur.size];
    mol2_read_at(&witness_cur, witness, witness_cur.size);
    assert(witness_cur.size == 3 && witness[0] == 0x12 && witness[1] == 0x34);

The rust version is much simpler:

impl Read for Vec<u8> {
    fn read(&self, buf: &mut [u8], offset: usize) -> Result<usize, Error> {
        let mem_len = self.len();
        if offset >= mem_len {
            return Err(Error::OutOfBound);
        }

        let remaining_len = mem_len - offset;
        let min_len = min(remaining_len, buf.len());

        if (offset + min_len) > mem_len {
            return Err(Error::OutOfBound);
        }
        buf[0..min_len].copy_from_slice(&self.as_slice()[offset..offset + min_len]);
        Ok(min_len)
    }
}

// same as `make_cursor_from_memory` in C
impl From<Vec<u8>> for Cursor {
    fn from(mem: Vec<u8>) -> Self {
        Cursor::new(MAX_CACHE_SIZE, mem.len(), Box::new(mem))
    }
}

Cache support

When the data is read from data source via syscall, the costs on every syscall is expensive. It would be great if it can read more data for future use for each syscall: now it supports cache for every reading. See mol2_read_at(in C) or read_at (in Rust) for more information.


Split declaration and definition for C

When the header file is generated, it can only be included in one single source file. If you choose multiple source files, it's better to split declaration and definition. Follow the following steps:

  1. Define macro "MOLECULEC_C2_DECLARATION_ONLY" and include the header files
#define MOLECULEC_C2_DECLARATION_ONLY
#include "sample-api2.h"

See here. It can be repeated for every source files if needed.

  1. Include header file fully in another source file (.c)
#include "sample-api2.h"

See here. It can only be done once.

For CKB developer

There is an already generated file blockchain-api2.h, together with molecule2_reader.h: they can be included in source file directly.

The original mol file is here.

About

Improved C plugin for the molecule serialization system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages