ADR #011: Block Data Sync Overhaul: Part I - Storage

Changelog

23.08.22: Initial unfinished draft
14.09.22: The first finished version

Authors

@Wondertan

I start to like writing ADRs step-by-step. Also, there is a trick that helps: imagine like you are talking to a dev who just joined the team to onboard him.

Glossary

LN - Light Node
FN - Full Node
BN - Bridge Node
EDS(Extended Data Square) - plain Block data omitting headers and other block metadata.
NMT - Namespaced Merkle Tree
DataRoot - alias for DAHeader

Context

Status Quo

Current block data synchronization is done over Bitswap, traversing NMT trees of rows and columns of data square quadrants. We know from empirical evidence that it takes more than 200 seconds(~65000 network requests) to download a 4MB block of 256kb shares, which is unacceptable and must be much less than the block time(15/30sec).

The DASing, on the other hand, shows acceptable metrics for the block sizes we are aiming for initially. In the case of the same block, a DAS operation takes 50ms * 8(technically 9) blocking requests, which is ~400ms in an ideal scenario (excluding disk IO).

Getting data by namespace also needs to be improved. The time it takes currently lies between BlockSync and DASing, where more data equals more requests and more time to fulfill the requests.

Mini Node Offsite 2022 Berlin

To facilitate and speed up the resolution of the problem, we decided to make the team gathering in Berlin for four days. With the help of preliminary preparations by @Wondertan and guest @willscott, we were able to find a solution in 2 days to match the following requirements:

Sync time less than block time(ideally sub-second)
Data by namespace less than block time(ideally sub-second)
Pragmatic timeframe
- We need this done before incentivized testnet
- We don't have time to redesign the protocol from scratch
Keep Bitswap, as it suffices for DAS and solves the data withholding attack
- Existing Bitswap logic kept as a fallback mechanism for the case of reconstruction from light nodes
Keeping random hash-addressed access to shares for Bitswap to work

Decision

This ADR intends to outline design decisions for block data storage. In a nutshell, the decision is to use CAR format and Dagstore for extended block storage and custom p2p Req/Resp protocol for block data syncing(whole block and data by namespace id) in the happy path. The p2p portion of the document will come in the subsequent Part II document.

Key Design Decisions

FNs/BNs store EDSes serialized as CAR files. CAR format provides an efficient way to store Merkle DAG data, like EDS with NMT. It packs such DAG data into a single blob which can be read sequentially in one read and transferred over the wire. Additionally, CARv2 introduces pluggable indexes over the blob allowing efficient random access to shares and NMT Proofs in one read (if the index is cached in memory).
FNs/BNs manage a top-level index for hash to CAR block file mapping. Current DASing for LNs requires FNs/BNs to serve simple hash to data requests. The top-level index maps any Share/NMT Proof hash to any block CARv1 file so that FNs/BNs can quickly serve DA requests.
FNs/BNs address EDSes by DataRoot.Hash. The only alternative is by height; however, it does not allow block data deduplication in case EDSes are equal and couples the Data layer/pkg with the Header layer/pkg.
FNs/BNs run a single instance of DAGStore to manage CAR block files. DAGStore provides the top-level indexing and CARv2-based indexing per each CAR file. In essence, it's an engine for managing any CAR files with indexing, convenient abstractions, tools, recovery mechanisms, etc.
- EDSes as CARv1 files over CARv2. CARv2 encodes indexes into the file, while DAGStore maintains CARv2-based indexing. Usage of CARv1 keeps only one copy of the index, stores/transfers less metadata per EDS, and simplifies reading EDS from a file.
LNs DASing remains untouched. The networking protocol and storage for LNs remain intact as it fulfills the requirements. Bitswap usage as the backbone protocol for requesting samples and global Badger KVStore remain unaffected.

Detailed Design

All the comments on the API definitions should be preserved and potentially improved by implementations.

The new block storage design is solely additive. All the existing storage-related components and functionality are kept with additional components introduced. Altogether, existing and new components will be recomposed to serve as the foundation of our improved block storage subsystem.

The central data structure representing Celestia block data is EDS(rsmt2d.ExtendedDataSquare), and the new storage design is focused around storing entire EDSes as a whole rather than a set of individual chunks, s.t. storage subsystem can handle storing and streaming/serving blocks of 4MB and more.

EDS Serde

Storing EDS as a whole requires EDS (de)serialization. For this, the CAR format is chosen.

`share.WriteEDS`

To write EDS into a stream/file, WriteEDS is introduced. Internally, it

Re-imports EDS similarly to ipld.ImportShares
- Using Blockservice with offline exchange and in-memory Blockstore
- With NodeVisitor, which saves to the Blockstore only NMT Merkle proofs(no shares) NOTE: len(node.Links()) == 2
  - Actual shares are written further in a particular way explained further
Creates and writes header CARv1Header
- Fills up Roots field with EDS.RowRoots/EDS.ColRoots roots converted into CIDs
Iterates over shares in quadrant-by-quadrant order via EDS.GetCell
- Writes the shares in row-by-row order
Iterates over in-memory Blockstore and writes NMT Merkle proofs stored in it

NOTES:

CAR provides a utility to serialize any DAG into the file and there is a way to serialize EDS into DAG(share/ipld.ImportShares). This approach is the simplest and traverses shares and Merkle Proofs in a depth-first manner packing them in a CAR file. However, this is incompatible with the requirement of being able to truncate the CAR file reading out only the first quadrant out of it without NMT proofs, so serialization must be different from the utility to support that.
Alternatively to WriteEDS, an EDSReader could be introduced to make EDS-to-stream handling more idiomatic and efficient in some cases, with the cost of more complex implementation.

// WriteEDS writes the whole EDS into the given io.Writer as CARv1 file.
// All its shares and recomputed NMT proofs.
func WriteEDS(context.Context, *rsmt2d.ExtendedDataSquare, io.Writer) error

`share.ReadEDS`

To read an EDS out of a stream/file, ReadEDS is introduced. Internally, it

Imports EDS with an empty pre-allocated slice. NOTE: Size can be taken from DataRoot Row/Col len
Wraps given io.Reader with BlockReader
Reads out blocks one by one and fills up the EDS quadrant via EDS.SetCell
- In total there should be shares-in-quadrant amount of reads.
Recomputes and validates via EDS.Repair

// ReadEDS reads an EDS quadrant(1/4) from an io.Reader CAR file.
// It expects strictly first EDS quadrant(top left).
// The returned EDS is guaranteed to be full and valid against the DataRoot, otherwise ReadEDS errors.
func ReadEDS(context.Context, io.Reader, DataRoot) (*rsmt2d.ExtendedDataSquare, error)

`share.EDSStore`

FNs/BNs keep an EDSStore to manage every EDS on the disk. The EDSStore type is introduced in the share pkg. Each EDS together with its Merkle Proofs serializes into a CARv1 file. All the serialized CARv1 file blobs are mounted on DAGStore via Local FS Mounts and registered as Shards.

The introduced EDSStore also maintains (via DAGStore) a top-level index enabling granular and efficient random access to every share and/or Merkle proof over every registered CARv1 file. The EDSStore provides a custom Blockstore interface implementation to achieve access. The main use-case is randomized sampling over the whole chain of EDS block data and getting data by namespace.

type EDSStore struct {
 basepath string 
 dgstr dagstore.DAGStore
 topIdx index.Inverted 
 carIdx index.FullIndexRepo
 mounts  *mount.Registry
 ...
}

// NewEDSStore constructs EDStore over OS directory to store CARv1 files of EDSes and indices for them.
// Datastore is used to keep the inverted/top-level index.
func NewEDSStore(basepath string, ds datastore.Batching) *EDSStore {
 topIdx := index.NewInverted(datastore)
 carIdx := index.NewFSRepo(basepath + "/index")
 mounts := mount.NewRegistry()
 r := mount.NewRegistry()
 err = r.Register("fs", &mount.FSMount{FS: os.DirFS(basePath + "/eds/")}) // registration is a must 
 if err != nil {
  panic(err)
 }
 
 return &EDSStore{
  basepath: basepath,
  dgst: dagstore.New(dagstore.Config{...}),
  topIdx: index.NewInverted(datastore),
  carIdx: index.NewFSRepo(basepath + "/index")
  mounts: mounts,
 }   
}

NOTE: EDSStore should have lifecycle methods(Start/Stop).

`share.EDSStore.Put`

To write an entire EDS Put method is introduced. Internally it

Opens a file under Store.Path/DataRoot.Hash path
Serializes the EDS into the file via share.WriteEDS
Wraps it with DAGStore's FileMount
Converts DataRoot's hash into the shard.Key
Registers the Mount as a Shard on the DAGStore

NOTE: Registering on the DAGStore populates the top-level index with shares/proofs accessible from stored EDS, which is out of the scope of the document.

// Put stores the given data square with DataRoot's hash as a key.
//
// The square is verified on the Exchange level and Put only stores the square trusting it.
// The resulting file stores all the shares and NMT Merkle Proofs of the EDS.
// Additionally, the file gets indexed s.t. Store.Blockstore can access them. 
func (s *Store) Put(context.Context, DataRoot, *rsmt2d.ExtendedDataSquare) error

`share.EDSStore.GetCAR`

To read an EDS as a byte stream GetCAR method is introduced. Internally it

Converts DataRoot's hash into the shard.Key
Acquires ShardAccessor and returns io.ReadCloser from it

NOTES:

DAGStores's ShardAccessor has to be extended to return an io.ReadCloser. Currently, it only returns a Blockstore of the CAR
The returned io.Reader represents actual EDS exchanged over the wire, which should be limited to return the first quadrant

// GetCAR takes a DataRoot and returns a buffered reader to the respective EDS serialized as a CARv1 file.
// 
// The Reader strictly reads the first quadrant(1/4) of EDS, omitting all the NMT Merkle proofs.
// Integrity of the store data is not verified. 
// 
// Caller must Close returned reader after reading.
func (s *Store) GetCAR(context.Context, DataRoot) (io.ReadCloser, error)

`share.EDSStore.Blockstore`

Blockstore method returns a Blockstore interface implementation instance, providing random access over share and NMT Merkle proof in every stored EDS. It is required for FNs/BNs to serve DAS requests over the Bitswap and for reading data by namespace.

There is a frozen/un-merged implementation of Blockstore over DAGStore and CARv2 indexes.

NOTES:

We can either use it(and finish if something is missing for our case) or implement custom optimized for our needs.
The Blockstore does not store whole Celestia Blocks, but IPFS blocks. We represent Merkle proofs and shares in IPFS blocks.

// Blockstore returns an IPFS Blockstore providing access to individual shares/nodes of all EDS 
// registered on the Store. NOTE: The Blockstore does not store whole Celestia Blocks but IPFS blocks. 
// We represent `shares` and NMT Merkle proofs as IPFS blocks and IPLD nodes so Bitswap can access those.
func (s *Store) Blockstore() blockstore.Blockstore

`share.EDSStore.Get`

To read an entire EDS Get method is introduced. Internally it:

Gets a serialized EDS io.Reader via Store.GetCAR
Deserializes the EDS and validates it via share.ReadEDS

NOTE: It's unnecessary, but an API ergonomics/symmetry nice-to-have.

// Get reads EDS out of Store by given DataRoot.
// 
// It reads only one quadrant(1/4) of the EDS and verifies the integrity of the stored data by recomputing it.
func (s *Store) Get(context.Context, DataRoot) (*rsmt2d.ExtendedDataSquare, error)

`share.EDSStore.Has`

To check if EDSStore keeps an EDS Has method is introduced. Internally it:

Converts DataRoot's hash into the shard.Key
Checks if GetShardInfo does not return ErrShardUnknown

NOTE: It's unnecessary, but an API ergonomics/symmetry nice-to-have.

// Has checks if EDS exists by the given DataRoot.
func (s *Store) Has(context.Context, DataRoot) (bool, error)

`share.EDSStore.Remove`

To remove stored EDS Remove method is introduced. Internally it:

Converts DataRoot's hash into the shard.Key
Destroys Shard via DAGStore
- Internally removes its Mount as well
Removes CARv1 file from disk under Store.Path/DataRoot.Hash path

NOTES:

It's unnecessary, but an API ergonomics/symmetry nice-to-have
GC logic on the DAGStore has to be investigated so that Removing is correctly implemented

// Remove removes EDS from Store by the given DataRoot and cleans up all the indexing.
func (s *Store) Remove(context.Context, DataRoot) error

Reading Data By Namespace

Generally stays unchanged with minor edits:

share/ipld.GetByNamespace is kept to load data from disk only and not from the network anymore
- Using Blockservice with offline exchange
- Using Blockstore provided by EDSStore
share/ipld.GetByNamespace is extended to return NMT Merkle proofs
- Similar to share/ipld.GetProofsForShares
- Ensure Merkle proofs are not duplicated!

As an extension, share/ipld.GetByNamespace can be modified to share.CARByNamespace, returning CARv1 Reader with encoded shares and NMT Merkle Proofs.

EDS Deduplication

Addressing EDS by DataRoot allows us to deduplicate equal EDSes. EDS equality is very unlikely to happen in practice, beside empty Block case, which always produces the same EDS.

Empty Block/EDS

The empty block is valid and small EDS. It can happen in the early stage of the network. Its body is constant, and to avoid transferring it over the wire, the EDSStore should be pre-initialized with an empty EDS value.

EDSStore Directory Path

The EDSStore on construction expects a directory to store CAR files and indices. The path should be gotten based on node.Store.Path.

Alternative Approaches

Extended blocks as a set of share blobs and Merkle proofs in global Store (current approach with KVStore)
Extended block as a single blob only(computing Merkle proofs)
Extended block as a single blob and Merkle proofs
Extended block as a set of DAG/CAR blobs
Extended block as a single DAG/CAR blob

Considerations

EDS to/from CARv2 converting performance. Current sync design assumes two converts from CAR to EDS on the protocol layer and back to CAR when storing the EDS. Rsmt2d allocates on most operations with individual shares, and for more giant blocks during sync, these allocations put significant pressure on GC. One way to substantially alleviate this is to integrate the bytes buffer pool into rmst2d.
Disk usage increases from the top-level index. This is a temporary solution. The index will have to be removed. LNs know which block they sample and can provide DataRoot's hash together with sample request over Bitswap, removing the need for hash-to-eds-file mapping. This requires us to either facilitate implementation of Bitswap's auth extension or proposing a custom Bitswap message extension. Subsequently, the Blockstore implementation provided via EDSStore would have to be changed to expect DataRoot's hash to be passed through the context.Context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adr-11-blocksync-overhaul.md

adr-11-blocksync-overhaul.md

ADR #011: Block Data Sync Overhaul: Part I - Storage

Changelog

Authors

Glossary

Context

Status Quo

Mini Node Offsite 2022 Berlin

Decision

Key Design Decisions

Detailed Design

EDS Serde

`share.WriteEDS`

`share.ReadEDS`

`share.EDSStore`

`share.EDSStore.Put`

`share.EDSStore.GetCAR`

`share.EDSStore.Blockstore`

`share.EDSStore.Get`

`share.EDSStore.Has`

`share.EDSStore.Remove`

Reading Data By Namespace

EDS Deduplication

Empty Block/EDS

EDSStore Directory Path

Alternative Approaches

Considerations

Files

adr-11-blocksync-overhaul.md

Latest commit

History

adr-11-blocksync-overhaul.md

File metadata and controls

ADR #011: Block Data Sync Overhaul: Part I - Storage

Changelog

Authors

Glossary

Context

Status Quo

Mini Node Offsite 2022 Berlin

Decision

Key Design Decisions

Detailed Design

EDS Serde

share.WriteEDS

share.ReadEDS

share.EDSStore

share.EDSStore.Put

share.EDSStore.GetCAR

share.EDSStore.Blockstore

share.EDSStore.Get

share.EDSStore.Has

share.EDSStore.Remove

Reading Data By Namespace

EDS Deduplication

Empty Block/EDS

EDSStore Directory Path

Alternative Approaches

Considerations

`share.WriteEDS`

`share.ReadEDS`

`share.EDSStore`

`share.EDSStore.Put`

`share.EDSStore.GetCAR`

`share.EDSStore.Blockstore`

`share.EDSStore.Get`

`share.EDSStore.Has`

`share.EDSStore.Remove`