-
Notifications
You must be signed in to change notification settings - Fork 108
Spec refining: specify valid merklepath segments and encoding #67
Comments
👏 so, it we agree on NFC. The next step is to figure out the list of bad byte sequences for keys? And then to agree on something like "parsing an ipld object with bad keys should error", right? |
Note: While I agree that UTF8-NFC should be the canonical encoding, some formats, languages, and operating systems my force us to use different encodings. We just have to make sure to use UTF8-NFC when comparing. |
In the unixfsv2 discussions, I actually proposed our implementation should probably take a |
IIRC, the consensus is to just not allow them in paths. Just like |
Mission
It's important to specify precisely what is a valid merklepath segement in IPLD.
The spec currently contains a "TODO: list path resolving restrictions" and this could be improved :)
Why
First, a quick clarification: "merklepath segments" are a distinct concept from "IPLD Selectors". Merklepaths are a specific and limited implementation of IPLD Selectors; they can only specify a traversal to a single object; and importantly, we want them to be serializable in a way that's easy for humans to operate. To quote the current spec for motivations:
(Perhaps "ASCII" is a little over-constrained there. The spec also says "
IPLD paths MUST be universal and avoid oppressing non-english societies (e.g. use UTF-8, not ASCII)" -- we might want to refine those two lines after we tackle the rest of this issue.)
Second of all, just a list of other issues that are fairly closely related to a need for clarity on this subject:
.
#37 -- "Spec refining: make sure that an attribute cannot be named.
"As this list makes evident... we really need to get this nailed down.
Mission, refined
Okay, motivations and intro done. What do we need to do?
(1) Update the spec to be consistently clear about IPLD keys versus valid merklepath segments. This distinction seems to exist already, but it's tricky, so we should hammer it.
(2) Define normal character encoding. (I think it's now well established that this is necessary -- merklepath segments are absolutely for direct human use, so we're certainly speaking of chars rather than bytes; and also unicode is complex and ignoring normalization is not viable.)
(3) Define any blacklisting of any further byte sequences which are valid normalized characters but we nonetheless don't want to see in merklepath segments.
(4) Ensure we're clear about what happens when an IPLD key is valid but as a key but not a merklepath segment (e.g. it's unpathable).
(And one more quick note: a lot of this has been in discussion already as part of sussing out the unixfsv2 spec. In unixfsv2, we've come to the conclusion that some of our path handling rules are quantum-entangled with the IPLD spec for merklepaths. Unixfsv2 may apply more blacklistings of byte sequences which are problematic than IPLD merklepath segements, so we don't have to worry about everything up here; but we do want to spec this first, so we can make sure the Unixfsv2 behavior normalizers are a nice subset of the IPLD merklepath rules.)
Progress
Regarding (1): "just a small matter of writing" once we nail the rest...
Regarding (2): We have an answer and the answer is "NFC". (At least, I think we have an answer with reasonable consensus.) We had a long thread about this in the context of unixfsv2, but entirely applicable here in general. Everyone seems to agree that UTF8 is a sensible place to be and NFC encoding is a sensible, already-well-specified normalization to use. And importantly, in practice, NFC is the encoding seen in practically all documents everywhere, so choosing NFC means we accept the majority of strings unchanged. Whew. dusts off hands
Regarding (3): Lots of example choices in ipfs/kubo#1710 . We need to reify that into a list in the spec.
Regarding (4): Open field?
(I'll update this "progress" section as discussion... progresses.)
The text was updated successfully, but these errors were encountered: