-
Notifications
You must be signed in to change notification settings - Fork 3
Spec Proposal #2
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Draft IPLD Unixfs Spec | ||
|
||
## Basic Structure | ||
|
||
A Unixfs is either a file or a directory. | ||
The top level IPLD object is a CBOR map with at least two fields: `type` and `data` | ||
and maybe a few other such as a version string or a set of flags. | ||
The `type` field is either `file` or `dir`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I say better to define a CBOR tag for files and and a tag for directories, and define a file as a tagged array and a dir as a tagged map. That makes it clear from the first atomic in the CBOR that you are parsing UnixFS. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ehmry we can do that, but do we then need to register the tags? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really, I was thinking just picking two random uint64 tags. |
||
|
||
## IPLD `file` | ||
|
||
If an IPLD file is a leaf its CID type is `raw` (0x55) and has no structure. | ||
Otherwise its CID type is `dag-cbor` (0x71). | ||
The `type` field is set to `file` and the `data` field is an CBOR array. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the current Unless we bake in a way to shard this we'll be limited to files that are smaller than ~5GB. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I'm a lot less concerned about files under 5GB than I am concerned with not being able to develop a smart chunker for fear that if I don't use the max blocks size the node will be too large. I'm starting to think about developing a chunker for javascript bundles that uses the sourcemap from the bundler to chunk it into blocks built from each file. This should greatly reduce the new blocks that need to be pushed when new bundles are created, but the number of chunks will easily be greater 2,500. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan is to use a CHAMP instead of an array of links, like we do today for sharded directories. I have an implementation of a CHAMP (HAMT) in ipld here: https://github.com/ipfs/go-hamt-ipld There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The JS implementation for unixfs sharding can be found at https://github.com/ipfs/js-ipfs-unixfs-engine/tree/master/src/hamt. It was built by @pgte a while ago. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'm missing something, but I've only seen hash map trees used for named key values, not for an ordered array. If we're using this as a replacement for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This isn't sufficient for at least 3 use cases I can think of.
It's fine if we just want to say that these use cases are out of scope for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mikeal look at how the importers work and structure the graph of file parts. The answer is still 'dont include that many file parts in a single node'. Ipfs chunks and structures things into a recursive tree, not just a single level with a flat array of links. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Someone in Berlin mentioned that it was effectively an "array of arrays." One thing to consider with this design, range requests don't work without loading every part of the file from the beginning to the start of the range. There's no information about the size of the individual parts so the only way to know how to seek is to load them all in serial. Really not ideal, especially for media uses cases because it makes seeking quite slow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current design has range information, seeking is efficient and only has to load the required nodes for that graph traversal. I assume we would do exactly the same in V2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to see our fix to this involves moving from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small typo, 'a |
||
Each element of the array is CBOR map with the following fields: | ||
|
||
- `data`: link | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason why this field isn't called |
||
- `size`: cumulative size of `data` | ||
- `fsize`: (file size) cumulative size of the payload of `data` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kevina this seems backwards. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kevina precisely. The logical bytes ( The node overhead is only useful for estimating how much local storage would be necessary to grab the entire DAG locally, but given the overhead is never more than ~10 bytes per node, it becomes ( from my PoV at least ) needless cruft. I've done a lot of tests over the last year with DAGs specifically excluding mention of what you refer to as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note: This is not a CBOR array but a map the structure is I don't have a strong opinion on which sizes to include. @whyrusleeping thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think |
||
|
||
The `fsize` field is omitted if the link is `raw` as it is the same value as size. | ||
|
||
## IPLD `dir` | ||
|
||
An IPLD `dir` represents a directory. | ||
Its CID type is `dag-cbor` (0x71). | ||
The `type` field set to `dir` and the data field is an CBOR map. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small typo, 'a |
||
The key of the map is a filename and is a CBOR text string encoded in UTF-8. | ||
The value of the map is another CBOR map with the following standard fields: | ||
|
||
- `type` | ||
- `exe`: CBOR boolean: executable bit | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think instead of just having executable here, we should do a full rwxrwxrwx unix permissions set (a uint32) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is the full unix set of permissions does not have a lot of meaning on other operating systems. Even within unix systems it has limited meaning when stored in an archive. Others may have stronger opinions on this than me. In particular see #1 (comment) by @ehmry. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@whyrusleeping the full Of course this makes direct-query of type a bit harder, but then again every libc provides There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there is no harm of having this additional data stored. Systems that have no meaning of those bits will skip them, systems that have will use them and allow for preservation. It is very similar with uid and gid. They have no meaning on some systems, they may have no meaning on different machine with same system (different uid/gid mappings) but they are crucial if I wanted to, for example in future, use IPFS for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we should support the full range, but maybe not change the default behaviour? thinking about it a bit more, the 'readable' flag really doesnt make a lot of sense in this context. I can read anything thats in ipfs... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Storing the full One possible complication is how to handle the writable bit in For this version of the standard I feel rather strongly we should stick to just the executable bit as it was stated in the requirements (#1), or nothing at all (as @lgierth suggested we don't add additional meatadata). The full |
||
- `data`: normally a CBOR link, but can be other types depending on the value of the `type` field | ||
- `size`: cumulative size of `data` | ||
- `fsize`: (file size) cumulative size of the payload of `data` | ||
- `fname`: CBOR byte string: original filename if it differs from the key | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems unnecessary (though I admit i've missed a lot of the conversation from over in the other issue). Why would this differ from the map key? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes the discussion on #3 is rather long. It may differ because unix filenames are not required to be UTF-8. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @whyrusleeping two things are at play / in conflict:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hrm, I don't have a lot of strong opinions here. I will defer to @lgierth @diasdavid and @Stebalien There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this field going include which character set should be used to interpret it somehow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No. Just the raw byte string if it differs from the key (which is the filename) that is in utf-8. For display the key should be used. |
||
|
||
And at least the following optional fields: | ||
|
||
- `ro`: CBOR boolean: read only | ||
- `mtime`: Modification time | ||
- `attr`: CBOR Map: Extended attributes | ||
|
||
Additional fields may be defined. All implementation specific or user | ||
defined fields should be stored under the `attr` field. | ||
|
||
### Directory Types | ||
|
||
The type field is limited to a set of well defined values: | ||
|
||
* _omitted_: regular file | ||
* `dir`: directory entry | ||
* `special`: special file type (fifo, device, etc). | ||
The `data` field is a CBOR Map with at least one field to describe the type. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should overload 'data', especially avoid making it have different types based on the value of a key in the parent level. That sort of parsing is hard to do efficiently There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the simplicity of having the contents of the file entry always be in the same field, for most types it is an IPLD link, for symbolic links it is the target, for special file types in a CBOR map with the details of the special file. My thinking was the type would just be an interface and then cast the correct type once it is known. I can instead have the following fields:
I rather not provide special fields to describe the content of all the different types of Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we're going to have to inspect fields to determine what to do with things anyways. Overloading things doesnt really save us much in my opinion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @whyrusleeping I am having a hard time interpreting that comment, are you okay with my proposal ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More to the point, I don't want to enumerate the required fields in the first version of the spec. I like the |
||
* `symlink`: symbolic link. The `data` field is the contents of the link. | ||
* `other`: link to other IPLD object, links followed for GC and related operations | ||
* `unknown`: link to unknown objects, links not followed | ||
|
||
### Extended Attributes | ||
|
||
The extended attributes set is not well defined and can be used for vendor extensions and POSIX attributes that don't make sense on non-unix systems. | ||
Stripping this field MUST not change the meaning of the directory entry. | ||
These attributes SHOULD be passed along but do not have to be understood. | ||
|
||
Possible entries: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would "Extended Attributes" be a good place to optionally store explicit media type for problematic data types, as noted in #11 ? |
||
|
||
* `user`: unix user name | ||
* `uid`: unix numeric uid | ||
* `group`: unix group name | ||
* `gid`: unix numeric gid | ||
* `perm`: full unix permissions | ||
* extended posix attributes | ||
* windows specific attributes | ||
|
||
### Notes | ||
|
||
* Note all standard fields need to be defined for all files types. | ||
|
||
* The `type` field is omitted for regular files. | ||
* The `exe` field is only present when true and only makes sense for regular files | ||
* The `size` and `fsize` are only required when the type is a regular file and possibly a `dir`. | ||
For other types they may be defined if they have a meaningful value. | ||
* The `fsize` field is omitted for files that are leaves (i.e. `raw`) as it is the same value as `size`. | ||
|
||
* IPLD filenames must valid UTF-8 strings which the following additional constraints: | ||
(1) cannot contain the null (0x00) or `/` characters | ||
(2) cannot be the strings: `.` or `..` | ||
Other restricts may be put in place. | ||
If the original filename does not meet these requirements then an implementation MAY transform the file from | ||
the original, so it is valid IPLD file, and store the original file in the `fname` field. | ||
When extracting a directory to the filesystem an implementation | ||
MAY make use of `fname` to restore the original name. | ||
Implementations SHOULD reject invalid files with invalid names by default | ||
and only translate files when a special flag is given. | ||
When extracting implications SHOULD use the IPLD name and not `fname` unless a special flag is given. | ||
|
||
* To save space fields of a directory may be assigned integer values. | ||
Integers have the added benefit of conveying additional meaning based on there values; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small typo: 'there' vs 'their'. |
||
for example, to distinguish between standard and optional fields. | ||
|
||
* The `type` field may also be assigned integer values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stop calling this IPLD Unixfs, the current Unixfs is already IPLD. This proposal is to:
One of the design goals of the new Unixfs is that it should be 100% interopable with the old (a directory of Unixfs2 should be able to have a file of Unixfs1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "IPLD Unixfs" was @whyrusleeping idea and I just went along with it. We could call it "Unixfs V2", although I am not sure how much with want to stick with the unix filesystem structure as a model (I personally thing we should move away from it and focus on the compartments that are important to a generic archive structure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine not calling it ipld unixfs. @diasdavid is right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth mentioning this in the spec.
Also, what about Unixfs1 directories?