From 40c297e5bae03234d0d8ada4dbcf672b1e05e05d Mon Sep 17 00:00:00 2001 From: David Dias Date: Sat, 2 Jan 2016 19:59:29 +0100 Subject: [PATCH 1/3] initial commit for data importing spec --- data-importing/README.md | 80 +++++++++++++++++++++++++++++ data-importing/graphs/arch.monopic | Bin 0 -> 1598 bytes data-importing/graphs/arch.txt | 8 +++ 3 files changed, 88 insertions(+) create mode 100644 data-importing/README.md create mode 100644 data-importing/graphs/arch.monopic create mode 100644 data-importing/graphs/arch.txt diff --git a/data-importing/README.md b/data-importing/README.md new file mode 100644 index 00000000..65315c8c --- /dev/null +++ b/data-importing/README.md @@ -0,0 +1,80 @@ +RFC - IPFS Data Importing +========================= + +Authors: + +Reviewers: + + +> tl;dr; This document presents how data is chunked and represented inside the IPFS network. + +* * * + +# Abstract + +IPFS Data Importing spec describes the several importing mechanisms used by IPFS that can be also be reused by other systems. An importing mechanism is composed by one or more chunkers and data format layouts. + +# Status of this spec + +> **This spec is a Work In Progress (WIP).** + +# Organization of this document + +This RFC is organized by chapters described on the *Table of contents* section. + +# Table of contents + +- [%N%. Introduction]() +- [%N%. Requirements]() +- [%N%. Architecture]() +- [%N%. Interfaces]() +- [%N%. Implementations]() +- [%N%. References]() + +# Introduction + +### Goals + +- Have a set of primitives to digest, chunk and parse files, so that different chunkers can be replaced/added without any trouble. + +# Requirements + +# Architecture + +```bash + ┌───────────┐ ┌──────────┐ +┌──────┐ │ │ │ │ ┌───────────────┐ +│ DATA │━━━━━▶│ chunker │━━━━━━━▶│ layout │━━━━━━━▶│ DATA formated │ +└──────┘ │ │ │ │ └───────────────┘ + └───────────┘ └──────────┘ + ▲ ▲ + └─────────────────────────────────┘ + Importer +``` + +- `chunkers or splitters` algorithms that read a stream and produce a series of chunks. for our purposes should be deterministic on the stream. divided into: + - `universal chunkers` which work on any streams given to them. (eg size, rabin, etc). should work roughly equally well across inputs. + - `specific chunkers` which work on specific types of files (tar splitter, mp4 splitter, etc). special purpose but super useful for big files and special types of data. +- `layouts or topologies` graph topologies (eg balanced vs trickledag vs ext4, ... etc) +- `importer` is a process that reads in some data (single file, set of files, archive, db, etc), and outputs a dag. may use many chunkers. may use many layouts. + +# Interfaces + +#### chunker (splitters) + +#### layout (topologies) + +#### importer + +# Implementations + +#### chunker + +- go-chunk https://github.com/jbenet/go-chunk + +#### layout + +#### importer + +# References + diff --git a/data-importing/graphs/arch.monopic b/data-importing/graphs/arch.monopic new file mode 100644 index 0000000000000000000000000000000000000000..f4185c9637730330eeca0cc5f2b85b141e2e827a GIT binary patch literal 1598 zcmV-E2EqCNO;1iwP)S1pABzY8000000t4k*OLL<}5dJG$oN*QN;^A8=xuhzmRBnlj zLSiHlBX9ukuH({w&%7D|0P<-ACF>fAQfBR4a%qRvtdU)8%L zEx&1snq1N}$$FQFr7)+r>&om}3FeYw`-%k0J{D=* zJUs}$u~0#?6W?ri?ZVC^UP$qdYDu=vttURk+Bg$mW3N?Ax!KfQZs@t51LD=$zhV<7 zbR~cF<4VjGD+Sv^ra$l6we2>B1F*_~W=}1ZQA^*_=F38CK9pT^v@WYcTA*WlwX_EP zhhl13W!p?>(z#*o4WjMhm-t!L9VZ*;&4`)F{pzfFQV#G?ieXk3ppK*Y)qui)!N7+> z;M}#Fr%Am#ZujgwC<8sJcPZY7p@#CF+T`+9?bPrqJJI-~C z5~YuQ`IYOK^R#hdo@PzQ;$4ERDI!>2f8$&Wrv3!YDJ!0a*T9_)Mr%?#c~uZ=()Iz>A3T zq8(4txG0-M>ywP-l^6V==38l2@pbyrE+diO3BG;u(AkkltB>atr3kGSG7(g#SsGaE zu@_3NP7>cRp&QRZ-q{t!`yW;mgsxms-mZ=y;<7g>B)@Bo=Q^;?t0c(pRubagEU6>? zJaa}NQC3$23P_ITEcHvuD^buIIZBRx$RR6O_M;WEid z-!xVd{;nLGxYC)adNcL*)jLq{P`xAZzB?G+h-b}Mm@E?e7xCYp#Pf&xJQWoyVTJ-9 z1;h4Sg{koGBo#v+s|{Y-QlSYQ_TB7`JaCophg=^QA<1GUwI%a@-~f9;Pw&wLOG0=p zwpm8egjAHykP3}SMd%_=$YtRd&y&o`CJ-9&e7Qyd;OJtF`=D{U$kR`RIQp^%>NPp;sNUub+o5F*4-F@Qd<8+_+9d)0CcDqkZU%CCnkmv{`Is$cs zK+-1UbF`m?U9tUSQ=m!J;6gCKME$tGxB(r?-suP#BThXsxglK1~tK4 zL_#BBun9m@fPM}Xc>M_&7E&kjWPWUOxPhYmsyVEYt+^(lfT<}zC&kyK9=#{H11$Bf zfd^$D2Y3W7SqeOE&}$(N8g-EeeQo4HA-pKW7lpv15NS009LR%tS41A+hr2+P=OJ|N z1u6#r!oI}|)WI%u&M|Ok{{j^Uvo284@LSqHBd-+o!`$G#(Y{di#rW_0{^|2e14qIyjR Date: Mon, 4 Jan 2016 15:57:02 +0100 Subject: [PATCH 2/3] add intro and requirements --- data-importing/README.md | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/data-importing/README.md b/data-importing/README.md index 65315c8c..d04dc6f9 100644 --- a/data-importing/README.md +++ b/data-importing/README.md @@ -33,12 +33,42 @@ This RFC is organized by chapters described on the *Table of contents* section. # Introduction +Importing data into IPFS can be done in a variety of ways. These are use-case specific, produce different datastructures, produce different graph topologies, and so on. These are not strictly needed in an IPFS implementation, but definitely make it more useful. + +These data importing primitivies are really just tools on top of IPLD, meaning that these can be generic and separate from IPFS itself. + +Essentially, data importing is divided into two parts: + +- Layouts - The graph topologies in which data is going to be structured and represented, there can include: + - balanced graphs, simpler to implement + - trickledag, a custom graph optimized for seeking + - live stream + - database indices + - and so on +- Splitters - The chunking algorithms applied to each file, these can be: + - fixed size chunking (also known as dumb chunking) + - rabin fingerprinting + - dedicated format chunking, these require knowledge of the format and typically only work with certain time of files (e.g. video, audio, images, etc) + - special datastructures chunking, formats like, tar, pdf, doc, container and/org vm images fall into this category + ### Goals - Have a set of primitives to digest, chunk and parse files, so that different chunkers can be replaced/added without any trouble. # Requirements +These are a set of requirements (or guidelines) of the expectations that need to be fullfilled for a layout or a splitter: + +- a layout should expose an API encoder/decoder like, that is, able to convert data to its format and convert it back to the original format +- a layout should contain a clear umnambiguous representation of the data that gets converted to its format +- a layout can leverage one or more splitting strategies, applying the best strategy depending on the data format (dedicated format chunking) +- a splitter can be: + - agnostic - chunks any data format in the same way + - dedicated - only able to chunk specific data formats +- a splitter should expose also a encoder/decoder like API +- a splitter, once fed with data, should yield chunks to be added to layout or another layout of itself +- an importer is a aggregate of layouts and splitters + # Architecture ```bash @@ -60,9 +90,9 @@ This RFC is organized by chapters described on the *Table of contents* section. # Interfaces -#### chunker (splitters) +#### splitters -#### layout (topologies) +#### layout #### importer From b210bcabf46866fcc5f56822e677129e1646ac14 Mon Sep 17 00:00:00 2001 From: David Dias Date: Mon, 13 Feb 2017 08:25:24 -0800 Subject: [PATCH 3/3] rename to DEX (for now) and point to all of the discussions --- {data-importing => dex}/README.md | 12 +++++++++++- {data-importing => dex}/graphs/arch.monopic | Bin {data-importing => dex}/graphs/arch.txt | 0 3 files changed, 11 insertions(+), 1 deletion(-) rename {data-importing => dex}/README.md (93%) rename {data-importing => dex}/graphs/arch.monopic (100%) rename {data-importing => dex}/graphs/arch.txt (100%) diff --git a/data-importing/README.md b/dex/README.md similarity index 93% rename from data-importing/README.md rename to dex/README.md index d04dc6f9..88b03c76 100644 --- a/data-importing/README.md +++ b/dex/README.md @@ -1,8 +1,11 @@ -RFC - IPFS Data Importing +RFC - DEX (name still under consideration) ========================= Authors: +- David Dias +- Juan Benet + Reviewers: @@ -18,6 +21,13 @@ IPFS Data Importing spec describes the several importing mechanisms used by IPFS > **This spec is a Work In Progress (WIP).** +Lots of discussions around this topic, some of them here: + +- https://github.com/ipfs/notes/issues/204 +- https://github.com/ipfs/notes/issues/216 +- https://github.com/ipfs/notes/issues/205 +- https://github.com/ipfs/notes/issues/144 + # Organization of this document This RFC is organized by chapters described on the *Table of contents* section. diff --git a/data-importing/graphs/arch.monopic b/dex/graphs/arch.monopic similarity index 100% rename from data-importing/graphs/arch.monopic rename to dex/graphs/arch.monopic diff --git a/data-importing/graphs/arch.txt b/dex/graphs/arch.txt similarity index 100% rename from data-importing/graphs/arch.txt rename to dex/graphs/arch.txt