Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specs/ipfs storage engine #8

Merged
merged 4 commits into from
May 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions specs/docs/distributed-storage-engine/ipfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# IPFS

IPFS is a distributed protocol that allow you to replicate data among network, you can put a data to IPFS and get those data back as long as it wasn't run out of liveness. Data will be stored as blocks and each block will be identified by its digest.

## PeerID

PeerID is a unique identifier of a node in the network. It's a hash of public key of the node. Lip2p2 keypair is handle by its keychain. You can get the PeerID by:

```ts
const libp2p = await createLibp2p({});
libp2p.peerId.toString();
```

## CID

CID is a unique fingerprint of data you can access the data as long as you know the exactly CID. The CID was calculated by hash function but it isn't data's digest. Instead the CID was calculated by digests of blocks of data.

Combining that digest with codec information about the block using multiformats:

- Multihash for information on the algorithm used to hash the data.
- Multicodec for information on how to interpret the hashed data after it has been fetched.
- Multibase for information on how the hashed data is encoded. Multibase is only used in the string representation of the CID.

In our implementation we use CID v1 and use `SHA256` + `base58`. I supposed that `poseidon` could be better in the long term so we need to make a poseidon proposal to `multihash`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## IPNS

As we know from above, each DAG node is immutable. In the reality, we want to keep the pointer to the data immutable. [IPNS](https://docs.ipfs.tech/concepts/ipns/) will solve this by provide a permanently pointer (in fact it's a hash of public key).

## Merkle DAG

A Merkle DAG is a DAG where each node has an identifier, and this is the result of hashing the node's contents — any opaque payload carried by the node and the list of identifiers of its children — using a cryptographic hash function like SHA256. This brings some important considerations.
chiro-hiro marked this conversation as resolved.
Show resolved Hide resolved

Our data will be stored in sub-merkle DAG. Every time we alter a leaf, it's also change the sub-merkle DAG node and it's required to recompute the CID, this will impact our implementation since we need a metadata file to keep track on CIDs and its children.
chiro-hiro marked this conversation as resolved.
Show resolved Hide resolved

We can perform a lookup on a merkle DAG by using the CID of the root node. We can also perform a lookup on a sub-merkle DAG by using the CID of the root node of the sub-merkle DAG. DAG traversal is a recursive process that starts at the root node and ends when the desired node is found. This process is cheap and fast, since it only requires the node identifier.

## Javascript IPFS

[js-ipfs](https://github.com/ipfs/js-ipfs) paves the way for the Browser implementation of the IPFS protocol. Written entirely in JavaScript, it runs in a Browser, a Service Worker, a Web Extension and Node.js, opening the door to a world of possibilities.

We switch to [Helia](https://github.com/ipfs/helia) due to the `js-ipfs` is discontinued.

## libp2p

LibP2p provide building blocks to build p2p application, it handled all p2p network related along side with its modules. It's flexible to use and develop with [libp2p](https://github.com/libp2p/js-libp2p). To config and work with libp2p you need to define:

- Transport:
- [TCP](https://github.com/libp2p/js-libp2p-tcp): TCP transport module help you to manage connection between nodes natively. TCP handles connect at transport layer (layer 4) that's why it's more efficient to maintain connection. But it's only work for `Node.js` run-time.
- [WebSockets](https://github.com/libp2p/js-libp2p-websockets): WebSocket in contrast to TCP, it's work on application layer (layer 7) that's why it's less efficient to maintain connection. But it's work for both `Node.js` and `Browser`.
- Encryption: [noise](https://github.com/ChainSafe/js-libp2p-noise), we don't have any option since TLS didn't have any implement for JS.
- Multiplexer:
- [mplex](https://github.com/libp2p/js-libp2p-mplex) `mplex` is a simple stream multiplexer that was designed in the early days of libp2p. It is a simple protocol that does not provide many features offered by other stream multiplexers. Notably, `mplex` does not provide flow control, a feature which is now considered critical for a stream multiplexer. `mplex` runs over a reliable, ordered pipe between two peers, such as a TCP connection. Peers can open, write to, close, and reset a stream. mplex uses a message-based framing layer like yamux, enabling it to multiplex different data streams, including stream-oriented data and other types of messages.
- [yamux](https://github.com/ChainSafe/js-libp2p-yamux). Yamux (Yet another Multiplexer) is a powerful stream multiplexer used in libp2p. It was initially developed by Hashicorp for Go, and is now implemented in Rust, JavaScript and other languages. enables multiple parallel streams on a single TCP connection. The design was inspired by SPDY (which later became the basis for HTTP/2), however it is not compatible with it. One of the key features of Yamux is its support for flow control through backpressure. This mechanism helps to prevent data from being sent faster than it can be processed. It allows the receiver to specify an offset to which the sender can send data, which increases as the receiver processes the data. This helps prevent the sender from overwhelming the receiver, especially when the receiver has limited resources or needs to process complex data. _**Note**: Yamux should be used over mplex in libp2p, as mplex doesn’t provide a mechanism to apply backpressure on the stream level._
- Node discovery: [KAD DHT](https://github.com/libp2p/js-libp2p-kad-dht) The Kademlia Distributed Hash Table (DHT), or Kad-DHT, is a distributed hash table that is designed for P2P networks. Kad-DHT in libp2p is a subsystem based on the [Kademlia whitepaper](https://docs.libp2p.io/concepts/discovery-routing/kaddht/#:~:text=based%20on%20the-,Kademlia%20whitepaper,-.). Kad-DHT offers a way to find nodes and data on the network by using a [routing table](https://docs.libp2p.io/concepts/discovery-routing/kaddht/#:~:text=by%20using%20a-,routing%20table,-that%20organizes%20peers) that organizes peers based on how similar their keys are.

_**Note:** KAD DHT boostrap didn't work as expected that's why you would see I connect the bootstrap nodes directly in the construction._

```ts
const nodeP2p = await createLibp2p(config);
// Manual patch for node bootstrap
const addresses = [
"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
].map((e) => multiaddr(e));
for (let i = 0; i < addresses.length; i += 1) {
await nodeP2p.dial(addresses[i]);
}
await nodeP2p.start();
```

## Helia

[Helia](https://github.com/ipfs/helia) is an new project that handle `ipfs` in modular manner. You can construct a new instance of `Helia` on top of libp2p.

```ts
return createHelia({
blockstore: new FsBlockstore("./local-storage"),
libp2p,
});
```

By passing libp2p instance to Helia, it's highly configurable.

## UnixFS

To handle file I/O, we used [UnixFS](https://github.com/ipfs/helia-unixfs). It can be constructed in the same way that we did with `Helia` but it will take a `Helia` instance instead of `libp2p`.

```ts
const fs = unixfs(heliaNode);
let text = "";
const decoder = new TextDecoder();

let testCID = CID.parse("QmdASJKc1koDd9YczZwAbYWzUKbJU73g6YcxCnDzgxWtp3");
if (testCID) {
console.log("Read:", testCID);
for await (const chunk of fs.cat(testCID)) {
text += decoder.decode(chunk, {
stream: true,
});
}
console.log(text);
}
```

After do research in `libp2p` and `ipfs` we introduce `StorageEngineIPFS` that handle `ipfs` I/O. The detail is given in [specs](./storage-engine.md). In our implementation, we used `datastore-fs` and `blockstore-fs` to persist changes.
74 changes: 74 additions & 0 deletions specs/docs/distributed-storage-engine/storage-engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
## Storage Engine

Storage Engine help us to handle file storage and local catching process, storage engine is also help to index files for further accession.

### IPFS Storage Engine

IPFS Storage Engine is a distributed storage engine based on [IPFS](https://ipfs.tech/). The `StorageEngineIPFS` ins an implementation of `IFileSystem` and `IFileIndex` that handle all I/O operations and indexing.

```ts
/**
* An interface of file engine, depend on the environment
* file engine could be different
*/
export interface IFileSystem<S, T, R> {
writeBytes(_data: R): Promise<T>;
write(_filename: S, _data: R): Promise<T>;
read(_filename: S): Promise<R>;
remove(_filename: S): Promise<boolean>;
}

/**
* Method that performing index and lookup file
*/
export interface IFileIndex<S, T, R> {
publish(_contentID: T): Promise<R>;
republish(): void;
resolve(_peerID?: S): Promise<T>;
}

/**
* IPFS file system
*/

export type TIPFSFileSystem = IFileSystem<string, CID, Uint8Array>;

/**
* IPFS file index
*/
export type TIPFSFileIndex = IFileIndex<PeerId, CID, IPNSEntry>;
```

The relationship between `StorageEngineIPFS` and other classes/interfaces is shown below:

```mermaid
classDiagram
LibP2pNode -- StorageEngineIPFS
Helia-- StorageEngineIPFS
UnixFS -- StorageEngineIPFS
IPNS -- StorageEngineIPFS
IFileSystem <|-- StorageEngineIPFS
IFileIndex <|-- StorageEngineIPFS
IFileSystem : writeByte(data Uint8Array) CID
IFileSystem : write(filename string, data Uint8Array) CID
IFileSystem : read(filename string) Uint8Array
IFileSystem : remove(filename string) boolean
IFileIndex : publish(contentID CID) IPNSEntry
IFileIndex : republish() void
IFileIndex : resolve(peerID PeerId) CID
StorageEngineIPFS : static getInstance(basePath, config)
```

In our implementation, we used `datastore-fs` and `blockstore-fs` to persist changes with local file, for now browser is lack of performance to handle connections and I/O. So the best possible solution is provide a local node that handle all I/O and connection.

#### File mutability

Since a DAG nodes are immutable but we unable to update the `CID` every time. So `IPNS` was used, `IPNS` create a record that mapped a `CID` to a `PeerID` hence the `PeerID` is unchanged, so as long as we keep the `IPNSEntry` update other people could get the `CID` of the zkDatabase.

#### Metadata

The medata file is holding a mapping of data's poseidon hash to its `CID` that allowed us to retrieve the data from ipfs. It's also use to reconstruct the merkle tree. Metada is stored on IPFS and we also make a copy at local file system.

#### BSON Document

BSON or Binnary JSON is a data structure that we used to encode and decode document. The document will be categorized into collections.
3 changes: 0 additions & 3 deletions zkdb/src/storage/index.ts

This file was deleted.

192 changes: 0 additions & 192 deletions zkdb/src/storage/ipfs-storage.ts

This file was deleted.

Loading