Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify whether merkledag links can be binary #1172

Closed
tv42 opened this issue Apr 30, 2015 · 20 comments
Closed

Clarify whether merkledag links can be binary #1172

tv42 opened this issue Apr 30, 2015 · 20 comments

Comments

@tv42
Copy link
Contributor

tv42 commented Apr 30, 2015

https://github.com/ipfs/go-ipfs/blob/cc5f6bb306430d146d3e5a3f7956073c10d0112b/merkledag/node.go#L45-L46

type Link struct {
    // utf string name. should be unique per object
    Name string // utf8

If they're UTF-8, they can't contain binary keys.

If binary is allowed, UI has to be careful to not print raw binary to terminal, web interface, etc, which it currently doesn't seem to be:

https://github.com/ipfs/go-ipfs/blob/cc5f6bb306430d146d3e5a3f7956073c10d0112b/core/commands/refs.go#L330

        s = strings.Replace(s, "<linkname>", linkname, -1)
@wking
Copy link
Contributor

wking commented Apr 30, 2015 via email

@tv42
Copy link
Contributor Author

tv42 commented Apr 30, 2015

For the record, I have no bias either way (both seem valuable, I can always shove my binary things in Data).

@wking
Copy link
Contributor

wking commented Apr 30, 2015

On Thu, Apr 30, 2015 at 12:45:15PM -0700, Tv wrote:

For the record, I have no bias either way (both seem valuable, I can
always shove my binary things in Data).

Knowing that the keys are UTF-8 makes it easy to convert filenames,
etc. to the local terminal and filesystem encodings. I'm not sure
where binary keys would be useful, but you can always base-* encode
them. That works both ways though. We can just say that keys will be
byte-sorted (for #915) but leave the encoding ambiguous, and then have
directory nodes declare their link-key encoding in their Data. Or we
can require that directory nodes use UTF-8 keys, but leave other nodes
the freedom to do something else. Then things like:

ipfs cat QmSomeList/NH

will by default reference the base-58 encoded key ND (decimal 1234,
hex 0x4d2), but accessing a UTF-8 encoded key for a non-directory node
would be awkward (you'd need to use the base-58 version of the
UTF-8-encoded key). So in the absence of a clear case for binary
keys, I'd recommend we stick to text we can type on the command line
;).

Incidentally, how do we handle path-separators in keys? Do we
distinguish between QMSomeList/ab/cd as “the ‘cd’ key of
‘QMSomeList/ab’” and “the ‘ab/cd’ key of ‘QMSomeList’”? In my
UTF-8-key world, I'd ban forward slashes in link keys (and maybe
backslashes too, if you want to be nice to the Windows folks).
Supporting keys like that doesn't seem to be worth the cost of working
out an escape syntax for IPFS commands (although supporting them via
FUSE would be fine).

@jbenet
Copy link
Member

jbenet commented Apr 30, 2015

I think we should use utf-8 for link names.

also I think that we would like to be able to extend the link protobuf in sub data structures. Any good way of doing it? Casting the protobuf (decoding with an extended schema) based on the type of the object?

@tv42
Copy link
Contributor Author

tv42 commented Apr 30, 2015

@wking There's more kinds of IPFS objects than directories. E.g. large files have Links with Name=="", try feeding that to your cat.

@jbenet The canonical way to extend proto3, decentralized, is Any. It uses URL-like strings as identifiers: https://developers.google.com/protocol-buffers/docs/proto3#any

@wking
Copy link
Contributor

wking commented Apr 30, 2015

On Thu, Apr 30, 2015 at 02:45:33PM -0700, Tv wrote:

@wking There's more kinds of IPFS objects than
directories. E.g. large files have Links with Name=="", try feeding
that to your cat.

I'd number those (in base-58) instead of having non-unique link names 1.

@tv42
Copy link
Contributor Author

tv42 commented Apr 30, 2015

There's a spec somewhere about those identifiers of Any. They're supposed to be usable to fetch the protobuf description (binary), or 404 (but definitely not give human-friendly pages). They should match the names of messages. I don't have a link for that convenient.

@tv42
Copy link
Contributor Author

tv42 commented Apr 30, 2015

@wking Too late and base58 is so slow it's silly. Besides, they have an index position in Links already.

@wking
Copy link
Contributor

wking commented Apr 30, 2015

On Thu, Apr 30, 2015 at 02:52:48PM -0700, Tv wrote:

@wking Too late…

With enough motivation we can always migrate ;).

… and base58 is so slow it's silly.

True. But you'll only care about encoding to base-58 when you add a
link, and you'll only care about decoding from base-58 when the user
asks for a chunk directly. Those seem rare enough to me to not be
worth the confusion of handling unknown encodings in directory
listings and other activity that displays keys that are intended for
human consumption.

Besides, they have an index position in Links already.

I'm fine having two object types for link maps (lookup by key) and
link arrays (lookup by index), since you can always have logic like:

  1. User asks for child ND of QmSomeHash.
  2. Does QmSomeHash haved named links?
    a. Yes, return link keyed by “ND”
    b. No, return the NDth entry in the link array.

but that doesn't work if you can mix-and match keyed and indexed
entries in one object's link field.

@tv42
Copy link
Contributor Author

tv42 commented Apr 30, 2015

@wking Please make that a separate issue, if you care strongly enough. In this one, I just want the language enough Name clarified one way or the other.

Also, even if it's UTF-8, that doesn't stop it from having sequences to e.g. reprogram your VT-100. UIs should probably look for non-printables and quote, or something.

@wking
Copy link
Contributor

wking commented Apr 30, 2015

On Thu, Apr 30, 2015 at 03:44:39PM -0700, Tv wrote:

Also, even if it's UTF-8, that doesn't stop it from having sequences
to e.g. reprogram your VT-100. UIs should probably look for
non-printables and quote, or something.

I think the issue is “is it a big restriction to require path-like
names for link keys”. If you think it is, you just want binary keys,
and you want to shift all of the encoding information and printability
checks somewhere else. If you think it's not, you want UTF-8 keys,
restrictions around control characters, forward slashes, etc…. I
still haven't heard of a clear “use case $x would die horribly if we
had those restrictions” story, but that doesn't mean that it doesn't
exist (or that we won't find it further down the road). But I don't
think we want to split the restriction handling between the format
(only UTF-8 keys) and the UI (escape any control characters). Either
the format should have all our restrictions, or they should all get
pushed to the UI.

The safer route seems to be to have arbitrary binary keys in the base
object, and then a standard "text-key" object type that adds a
standard Data entry with the key encoding and character
whitelists/blacklists. Then directories etc. can base themselves on
the text-key object and get that handling, while whatever needs
non-textual keys can use the root object without those restrictions.

@wking
Copy link
Contributor

wking commented Apr 30, 2015

On Thu, Apr 30, 2015 at 03:57:26PM -0700, W. Trevor King wrote:

The safer route seems to be to have arbitrary binary keys in the base
object, and then a standard "text-key" object type that adds a
standard Data entry with the key encoding and character
whitelists/blacklists. Then directories etc. can base themselves on
the text-key object and get that handling, while whatever needs
non-textual keys can use the root object without those restrictions.

Just to be clear, the typing information would be set in Data (or a
new object-type field) and be used to trigger the UI-side checks and
filtering. This approach would be transparent to all the internal
processing, which can just treat everything as having binary keys.

@tv42
Copy link
Contributor Author

tv42 commented May 1, 2015

Control characters are plenty valid in path-like names, if you trust UNIX. So is non-UTF-8. There only illegal path segments, by definition of path as per UNIX, are ones with bytes 0x00 or '/'.

If you want unixfs etc to enforce some other rules, please file an issue about that. I see no such code enforcing such a thing. (This may be an issue if you naively use Link.Name in JSON. Fun all around.)

@tv42
Copy link
Contributor Author

tv42 commented May 1, 2015

The idea of "standard Data entry" falls on its face, too. Either there is more than unixfs, or there isn't. If there is, the core layer cannot dictate what goes in Data.

@whyrusleeping
Copy link
Member

just to clarify, unixfs is just one of many formats that will be built upon the merkledag structure

@wking
Copy link
Contributor

wking commented May 1, 2015

On Thu, Apr 30, 2015 at 06:14:03PM -0700, Jeromy Johnson wrote:

just to clarify, unixfs is just one of many formats that will be
built upon the merkledag structure

Sure, there are also plans for some version-control objects (commits,
tags, …). But they'd certainly work fine if you required UTF-8 keys
that don't have nulls or forward slashes. Can anyone give an
example of an object type that needs non-UTF-8 keys? I'm happy to
admit that they exist, but I'd feel better with an explicit example.

And I do think we need a field besides Links and Data for a type
identifier, but that looks like ipfs/ipfs#36.

@tv42
Copy link
Contributor Author

tv42 commented May 1, 2015

@wking Here's a non-UTF-8 Name for you: mount ipns and run touch $(printf '\xff')

@wking
Copy link
Contributor

wking commented May 1, 2015

On Thu, Apr 30, 2015 at 09:04:02PM -0700, Tv wrote:

@wking Here's a non-UTF-8 Name for you: mount ipns and run touch
$(printf '\xff')

I'm fine not supporting that use case. We're not looking for full
POSIX compliance anyway (or are we?) I've certainly never seen
someone use a path name like that the wild.

@whyrusleeping
Copy link
Member

@jbenet

also I think that we would like to be able to extend the link protobuf in sub data structures. Any good way of doing it? Casting the protobuf (decoding with an extended schema) based on the type of the object?

protobuf extensions were made for precisely this: https://developers.google.com/protocol-buffers/docs/proto#extensions

@jbenet
Copy link
Member

jbenet commented May 1, 2015

@whyrusleeping

protobuf extensions were made for precisely this: https://developers.google.com/protocol-buffers/docs/proto#extensions

protobuf extensions were removed (almost completely) in proto3. replaced by any type.


@tv42

@jbenet The canonical way to extend proto3, decentralized, is Any. It uses URL-like strings as identifiers: https://developers.google.com/protocol-buffers/docs/proto3#any

Yeah, the Any type is sort-of linked-data-ish. It's a bit clunky with the wrapping, but AFAIK, it's possible to just decode a protobuf "twice" (not really twice) with different schemas:

message M1 {
  bytes foo = 1
  // start all extensions at 127
}

message M2 {
  bytes foo = 1
  bytes bar = 128
}

which we could do with links.

It's a bit annoying with standard protobuf parsing, which always decodes everything (instead of wrapping a buffer with an object with accessor methods to only decode what you need).


@wking @tv42

On the original point of this issue:

Clarify whether merkledag links can be binary

we should stick to UTF-8 names. names are meant to be printed out for users in browsers and terminals. and there shouldn't be unescaped slashes in names, otherwise path resolution gets hard to reason about.

perhaps we should open an issue about enforcing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants