Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

[WIP] Flexible Byte Layout #211

Merged
merged 25 commits into from
Jun 19, 2020
Merged

[WIP] Flexible Byte Layout #211

merged 25 commits into from
Jun 19, 2020

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Oct 19, 2019

I got the new js-unixfsv2 working using ipld-schema-gen. It’s awesome, but now my brain is mush and I need a rest.

I wanted to get this posted before I crash though. The structure I landed on for the layout of the byte data is very interesting and general purpose.

I figured out a way, using recursive unions, to create large DAG’s without encoding the layout algorithm into the data structure. This means that we can create a general read implementation for any layout we decide to implement in the future.

All you do is implement read() methods for 3 advanced layouts and those readers essentially call each other all the way through the DAG. So you can mix and match all the basic layout components on the write() side to your heart’s content. Since it’s a union, we also wouldn’t have trouble adding another one in the future, but the problem of “representing a linear list of bytes” isn’t so large that I anticipate needing another one.

@mikeal mikeal changed the title wip: Data [WIP] Data (Flexible Advanced Layout for Byte Data) Oct 19, 2019
@mikeal
Copy link
Contributor Author

mikeal commented Oct 19, 2019

I was so burned out last night I forgot to reference the current implementation https://github.com/ipld/js-unixfsv2/blob/master/src/data.js

This is using a slightly different schema because rootType is not yet implemented in the js schema parser, so my schema-gen library has a hack that links an advanced layout to a schema by using the same name. This will all be updated and put in-line with this schema once rootType is fully implemented.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only higher level nitpicks.

I'm not sure if "Data" is a good name. What about "Binary" or "Binary Data".

Also please lowercase the filename to be consistent with the repo :)

@mikeal
Copy link
Contributor Author

mikeal commented Oct 21, 2019

names of things

I struggled a lot with the naming, not just the union key names but also pluralization. I must have changed some of these names 2 or 3 times throughout the coarse of development.

As a result, I’m going to be a real push over on this, I’m basically willing to change the naming convention to whatever the first person says, because I’m not happy with my existing choices here and am happy to defer to the first person I hear ;)

@mikeal
Copy link
Contributor Author

mikeal commented Dec 5, 2019

We need to solidify this and get it finished.

@warpfork @rvagg any pressing concerns or final changes you’d like to see before it gets merged.

@warpfork
Copy link
Contributor

warpfork commented Dec 5, 2019

I have some mild observations at most, but overall assume you've thought about this and it represents progress... therefore, go ahead :)

Observations:

  • The asymmetry between reading and writing -- the latter needing More Choosepower in order to actually be defined -- is... interesting. I continue to be vaguely wary of this. (I think eventually we're going to end up with a nasty word to describe "ADLs that require additional programmatic intervention to make writes be defined".) But I won't go quite as far as to say we shouldn't explore it! And I appreciate that it's clearly described.

  • Similarly I'm just vaguely surprised this is as complex as it is. There's... three...?... different advanced layouts described in the same file? Which of these would I expect to start reading at, if I was going to do something with this, or implement something to match it? Which of them would I choose when writing data as an application that has all three available as a library, and what heuristics would I use to make that choice as an application author? The docs at the bottom of the file are good, but I think more granular docs on some of these types would help enormously.

  • +1 to @rvagg 's comments about some of the field serial naming seeming to be different flavors of arbitrary.

  • I'm not going to get too fussy about file paths at the moment, because I don't want to open the whole can of worms in a blocking way on this PR... but longer run: I suspect we'll want to make some sort of a directory where we can make something almost like a brief registry of various Advanced Layouts that are known to us. And the ones authored by us as core people might eventually end up sharing that directory with at least a handful of especially reputable other contributed ones... so ideally, we should position things to look like just well-known library parts rather than blessed things sooner than later. But again, not a blocking comment, we don't have to do the lifting on sorting that out right now.

@mikeal
Copy link
Contributor Author

mikeal commented Dec 5, 2019

Similarly I'm just vaguely surprised this is as complex as it is. There's... three...?... different advanced layouts described in the same file? Which of these would I expect to start reading at, if I was going to do something with this, or implement something to match it? Which of them would I choose when writing data as an application that has all three available as a library, and what heuristics would I use to make that choice as an application author? The docs at the bottom of the file are good, but I think more granular docs on some of these types would help enormously.

Theoretically, this could just be one Advanced Layout. The reason I made it 3 is so that the read() methods would be attached directly to relevant data, rather than just implemented as branching logic in one large Advanced Layout at the root.

It also makes the underlying data schemas and advanced layouts more usable independently. If I knew all my data had a low max chunking I could write a schema that just directly uses the ByteList. I couldn’t do that easily if all the read code was in a single place in the root Advanced Layout.

@mikeal
Copy link
Contributor Author

mikeal commented Dec 9, 2019

Ok, pushed a bunch of changes that are now ready for review.

I’m also going to need to go back and work on the implementation a bit to make sure none of these changes cause any problems. The current type generation code doesn’t have tuple representation either, so I’m going to need to implement that now ;)

@mikeal
Copy link
Contributor Author

mikeal commented Dec 9, 2019

Wow, ok, way simpler now. One thing lead to another and I was able to remove a bunch of things.

@Stebalien
Copy link
Contributor

If someone writes incorrect lengths, what do we expect the behavior to be in an implementation? Understanding what the failure mode is would help a bit here.

I'd like to declare exactly how/where we should return an error in the spec. If a the "part" part of a NestedByte doesn't match the given length when we resolve the part, we should return an error.

@mikeal
Copy link
Contributor Author

mikeal commented Apr 1, 2020

I'd like to declare exactly how/where we should return an error in the spec. If a the "part" part of a NestedByte doesn't match the given length when we resolve the part, we should return an error.

Agree, but “when we resolve the part” is a bit too ambiguous. We “resolve” the entire graph when we replicate it but because replicators work on generic IPLD graphs they won’t be able to produce errors during replication because they won’t understand this data structure or when to apply this schema.

I think “when we read the part” is less ambiguous and doable, but as @ribasushi has mentioned before, this still leaves you open to an odd situation when reading a slice of the data that occurs after a miss-alignment of the size will not cause an error and will return the data slice based on the encoded size information rather than the raw part length. This is probably fine but if we’re going to document it as part of the spec it should be this detailed, because we’re effectively saying that “when reading a slice of the data, the encoded size information is relied upon and miss-alignments will only cause errors if the miss-aligned part is being read.”

@Stebalien
Copy link
Contributor

Fair enough. With respect to reading parts that are correctly aligned when some aren't, my mental model here is that the "bad" sections are like corrupt disk sectors. You don't get an error until you actually try to read them.

@ribasushi
Copy link
Contributor

ribasushi commented Apr 9, 2020

mental model here is that the "bad" sections are like corrupt disk sectors. You don't get an error until you actually try to read them.

This is not an accurate model. A misalignment somewhere in the DAG creates a "fork". If one seeks directly to the past-misalignment part, in the current model there is no way to determine the DAG is iffy - everything comes back correctly.

Only by happening to read through a "DAG-arm" that contains the disagreeing numbers, you can detect the problem. In other words - if you want to be sure, the current constructs require you to read and validate each individually sized unit from 0 to your desired offset.

See my next comment below

@Stebalien
Copy link
Contributor

This is not an accurate model. A misalignment somewhere in the DAG creates a "fork".

By "fork", I assume you mean that reading sequentially yields different results than seeking? That's a bug in the current implementation, not something that has to be true. We can validate offsets as we read and return errors when we encounter misaligned sections.

@ribasushi
Copy link
Contributor

While trying to explain/illustrate my concern I think I may have arrived at a better formalization and something approaching a solution to the issue I keep bringing up. I need to let this "bake" over the weekend, and will update the thread. For the time being assume I didn't make the above comment 🤔

@chafey
Copy link
Contributor

chafey commented Jun 17, 2020

Is this design finalized? I have a need for something like this and am hoping to use this if it can be finalized.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 17, 2020

We’ve spent far too much time bike shedding on “naming things.”

I’ve pushed a few changes to naming and am closing the loop on bike shedding the names. I’ve also fixed the issue @riba found (reference name wasn’t migrated when we changed the name of the data structure).

I’m going to merge this in 24h unless there’s a strong objection. I’m also going to get the JS implementation in-line with the latest schema today.

@warpfork
Copy link
Contributor

One tiny ask: can we add a line of prose that states clearly that FlexibleByteLayout is the type we expect to encounter first" at the start of one of these documents? I think it is strongly implied that this is the case, but it would make a reader more confident to see it stated, I think.

Otherwise my green checkmark from ages ago still applies; this all looks reasonable to me.

@ribasushi
Copy link
Contributor

This comment is made against this version of the proposed spec: https://github.com/ipld/specs/blob/1345e737ecdeff3ee182a7458c4ce0667a43770b/data-structures/data.md

We’ve spent far too much time bike shedding on “naming things.”

I posit that the "bikeshedding on naming" is a symptom of an underlying lack of shared context. There are two parts to that:

  1. Unlike most ADL's that are designed to be upgradeable/swappable, this particular ADL is very special as it encodes streams - the most fundamental building block of any other ADL. If we conjecture that one of the goals of IPLD is to foster "emergent convergence" within the merkle-world, this layout becomes part of the "uncoordinated convergence toolset", namely:
    1. near-universally-preferred byte-chunking set of parameters (in progress)
    2. near-universally-preferred intermediate "link" node representation (this spec)
    3. near-universally-preferred node-tree arangement (in progress together with i., likely a two stage shrub+trickle arrangement)

If the answer to 1. above is "no not really, this is definitely not the final version for streams, this is something we are playing with" - then this is fine, and my only request is that the spec reflects that ( prescriptive/draft ? ). Otherwise we are sending the wrong signal to folks like @chafey who clearly want to start using this as the blessed way to write data

If the answer however is "yeah, we need to nail this down and try not to change this anymore", we get to:

  1. While the ADL spec in isolation seems straightforward, it is not clear how does it (or whether it can at all) reference already-existing raw blocks (which in turn is mandated by the answer to 1 being "yeah"). This was raised in [WIP] Flexible Byte Layout #211 (comment) but it is not clear where it left things. Specifically, if this is a "finalish version" we need to answer (here or in a consequent PR) the following:
    1. Is FlexibleByteLayout the actual entrypoint for a network-block (@warpfork's question [WIP] Flexible Byte Layout #211 (comment))? If yes - we may want to explicitly state that and/or move that definition first for clarity
    2. Is linking to raw ( 0x55 ) CID's allowed at part &FlexibleByteLayout? I.e. does a raw node satisfy the FlexibleByteLayout.Bytes union, or does one always have to wrap a cbor around even the smallest set of bytes?

I think this covers everything that bothers me :)

@rvagg
Copy link
Member

rvagg commented Jun 18, 2020

Is FlexibleByteLayout the actual entrypoint for a network-block (@warpfork's question #211 (comment))? If yes - we may want to explicitly state that and/or move that definition first for clarity

Yes, and this is something we've not got good answers to with schemas yet, we either need some kind of signal indicating top-level elements or a convention. For now I suggest we adopt the convention of putting top-level elements at the top of a schema doc and ordering things in reverse dependency order (the opposite of what you'd do in a C program).

Is linking to raw ( 0x55 ) CID's allowed at part &FlexibleByteLayout? I.e. does a raw node satisfy the FlexibleByteLayout.Bytes union, or does one always have to wrap a cbor around even the smallest set of bytes?

This is supported by the current iteration just fine. There's no tie between Schemas and CBOR, it's just a matter of shape and type. raw blocks would be matched by &Bytes, &Any or the kinded union that appears in FlexibleByteLayout. Having bytes as one of the kinds means that if you hit a raw either as the top-level element (i.e. a FlexibleByteLayout decoder is handed a CID<raw>) or via &FlexibleByteLayout. The bytes kind would also be matched if it hits a CBOR or DAG-CBOR block that contains only a byte string element, or a DAG-JSON block that contains only a {"/": { "bytes": String } }. And you could mix and match all of those types (and more that we may add in the future!) with this.

So to do my exercise again to look at the kinds of shapes this Schema might allow:

type FlexibleByteLayout union {
  | Bytes bytes
  | NestedByteList list
} representation kinded

type NestedByteList [ NestedByte ]

type NestedByte struct {
  length Int
  part &FlexibleByteLayout
} representation tuple
  1. single raw or other format containing only one top-level bytes
  2. [ [ length, CID<x> ], ... ] where x is either 1 or 2

And that's it I think.

Possible drawbacks:

  • It allows very deep nesting if you keep on linking to lists within lists, e.g..: [[ l, &[[ l, &[[ l, &[[ l, &[[ l, &[[ l, &[[ l, &[[ l, &[[ l, &[[ bytes ]] ]] ]] ]] ]] ]] ]] ]] ]] ]] (where & is a separate, linked block). I'm not sure there's anything we can do about that, nor should we probably try, since it allows arbitrary concatenation of byte arrays.
  • length is advisory only, but this continues to be a problem across our multi-byte data structures. I'd prefer the name to indicate it but will be satisfied with documentation about its advisory status.
  • The only way to fully-inline is to do it as a top-level bytes (raw or a single top-level bytes type). But that's probably not a big deal, I can't think of a case where you'd want to fully-inline but also need to retain explicit splitting and supporting that is going to complicate implementation. Perhaps if you wanted to do this, you could embed FlexibleByteLayout under a top-level container that also includes metadata about the splits, but that would be something for whoever needs this functionality to do.
  • You can't do partial inline, e.g. [ [ length, <bytes> ], [ length, CID ] ]. This might be useful if you wanted to form a stream while also minimising the number of blocks. I can imagine use-cases for this, but it's also something that could be added later if we decide it's a big deal (introducing a maybe-link between NestedByte and FlexibleByteLayout).

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this but would prefer two minor changes:

  1. Flip the ordering of the types, putting the top-level type at the top and the dependents in order below it.
  2. Add a note about length being advisory. Something like: length is a cumulative value across potentially nested multi-block chunks; therefore it should be considered advisory-only by applications using FlexibleByteLayout as it cannot be fully validated until all nested blocks are available.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 18, 2020

One tiny ask: can we add a line of prose that states clearly that FlexibleByteLayout is the type we expect to encounter first" at the start of one of these documents? I think it is strongly implied that this is the case, but it would make a reader more confident to see it stated, I think.

How would we go about enforcing this?

The schema can be applied to any node. Whether or not the top level type is a discrete block is really a decision that libraries are going to have to make when they decide how they want to support it. It’s part of the implementation logic, which isn’t defined in the spec or the schema.

I don’t see a clear problem with applying this schema to a node that is within block rather than the root of the block, but maybe I’m missing something.

@mikeal
Copy link
Contributor Author

mikeal commented Jun 18, 2020

@rvagg pushed fixes for you comments.

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Present state looks good, we can fix nits as they arise once more alignment takes place 🎉

@mikeal
Copy link
Contributor Author

mikeal commented Jun 18, 2020

While the ADL spec in isolation seems straightforward, it is not clear how does it (or whether it can at all) reference already-existing raw blocks

You can use existing raw blocks, in fact, what you can’t do is embed binary values in the fully nested structure, you actually MUST use links. This isn’t all that clear because of how recursion works in the schema.

NestedByte.part is a link to the FlexibleByteLayout union. Since one of the union kinds is bytes that means NestedByte.part can be a link to a raw block.

The only opportunity you have to use a binary value without linking to a raw block is if the entire FlexibleByteLayout is a single binary value or block.

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@rvagg
Copy link
Member

rvagg commented Jun 19, 2020

How would we go about enforcing this?

Right, and I agree that it's not enforcement that we need here. It's more about clarity about the expected entry point for the primary purpose of this schema. I expect we're going to find codegen and code interface reasons why we need more than just a convention (like the advanced keyword), but for now I think that making sure our documentation is clear about how these pieces fit together and their anticipated hierarchy.

As it is, this document has the same title as the top-level type, so it's now clear in that way as well as the presented order, so that's good enough for me for now. We can work on this challenge over time.

@mikeal mikeal merged commit f542d54 into master Jun 19, 2020
@warpfork
Copy link
Contributor

Thrilled that this is merged.

Quick recap on the discussion about "entry point types" in the schemas and how I feel about that: Yeah, I think ya'll kinda covered all facets of it already.... 👍

  • Docs for something like an ADL probably should make a habit of making it clear what the entry point type is... because that's defacto definitely something that exists for a use case like ADLs...
  • but they should do it in prose.
  • I'll double down on this: I'm fairly convinced that codifying it would be wrong, and we can put that thought to bed. Why? Because "entry point type" is a concept that does not exist for all use cases (as Mikeal also said in this comment; I totally agree) -- there are lots of times when I write one schema and want to use it for, say, a ton of different API endpoints in a web service; then there's lots of "entry point" types and we wouldn't want to write a new schema for each API endpoint!
  • I'm also thumbs up on making it a conventional habit to do "inverse of C order" and put the "entry point" types towards the top. That's how I usually write things, for whatever it's worth.

🚢

@chafey
Copy link
Contributor

chafey commented Jun 25, 2020

If the entrypoint matters it should be in prose. It wasn't mentioned in the spec so I wasn't sure, I left thinking "it didn't matter, I can choose any". It would be nice to have some examples in different languages of FBL implementations and of course a reader perhaps as part of a ADL Utility library. If anyone has any ideas on what would be a good example use of this, please let me know as I'll probably work on this today.

A few other cleanup items:

  1. We should add a link to this spec from the main page
  2. We should rename the page from data.md to flexible-byte-layout.md

I'll submit a pr for these along with some prose on the entrypoint

@warpfork
Copy link
Contributor

warpfork commented Jun 25, 2020

(I think we're probably oscillating around the same understanding here, but just in case a little more discussion and alternative examples help...)

I think "I can choose any" is also still technically true, just defacto not so much in this case... because the "choose" is at the scale of "any ADL implementation that wants to be interoperable", so it's kinda just one big single incident of choose in practice.

Example in another domain that's an interesting contrast: in the schema for Selectors, I introduced a SelectorEnvelope type in addition to the Selector type, and recommended that one might want to use that when one is handling serializing and deserializing messages that don't have established existing context that can make it clear that the following data is a Selector.

But that sentence is in prose in the docs and is heavy in conditionals on purpose: I simultaneously consider it totally reasonable that if someone else is designing a larger schema that uses Selectors somewhere inside their messages, they might just use the Selector type rather than the SelectorEnvelope type. Doing so would make perfect sense if they have additional surrounding structure like type MyCoolProtocolMessage struct { otherInfo CoolTypes; selectorForBees Selector; selectorForBalloons Selector }, for example, because there's already plenty of structural entropy there.

@rvagg rvagg deleted the data-advanced-layout branch July 21, 2020 03:48
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this pull request Oct 13, 2020
* wip: Data

* fix: typo

* fix: rename to lowercase file

* fix: various changes for code review

* fix: factor out double offset indexing

* fix: clearer naming

* fix: chop down to only nested byte lists

* feat: use kinded union

* fix: remove reference to old advanced layout

* fix: remove extra advanced layout

* fix: remove old type references

* fix: remove usage from schema, renames

* fix: remove ADL definition, cleanup description

* fix: remove unnecessary link in union

* fix: latest name change

* fix: moving to new directory

* fix: bring union to top level

* fix: remove option algorithm field

* fix: reduce to single list tuple

* fix: name lengthHint since the property is insecure

* fix: going back to length

* fix: move to new naming

* fix: @rvagg’s review comments

* fix: add document status, thanks @ribasushi
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants