Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement a faster dag structure for unixfs #687

Closed
wants to merge 1 commit into from
Closed

Conversation

whyrusleeping
Copy link
Member

Note: This code isnt alpha priority, and while it works and ive tested it pretty well, isnt on our TODO list. So other code should take priority.
This PR adds an alternate method in importer to build a DAG with. This method ensures that every node has data in it, making streaming much faster since you'll have some data to display while you're fetching the next blocks. I accomplish a mostly balanced tree by performing a sort of 'rotate' every time the required depth has filled up. This rotate moves all but the first child node under the first child node to make room for more nodes to be added at the root level. Doing this rotate ensures that the in pre-order traversal of the tree remains the same throughout the trees creation.

@whyrusleeping whyrusleeping added the status/in-progress In progress label Jan 29, 2015
@jbenet
Copy link
Member

jbenet commented Jan 29, 2015

@whyrusleeping please duplicate the data in this case. we've discussed this a ton of times... it must be possible to take only the leaves and regenerate the entire file.

Without this property we will not be able to easily share entire subsections of files with others, because the links will taint that data.

@whyrusleeping
Copy link
Member Author

You can still share subsections of the file with others, the tree structure allows for that just fine. Im curious what scenario youre worried about.

@jbenet
Copy link
Member

jbenet commented Jan 29, 2015

@whyrusleeping maybe i dont understand your desc. can you draw it?

@whyrusleeping
Copy link
Member Author

Well, any concerns you would have had about storing data in the intermediate links would still apply. In the case that I want just a subset of a file, from offset 4000 to offset 10000, I would just give a subtree starting at the node whose data contains offset 4000.

I tried doing the data duplication strategy, but it seemed very wasteful of bandwidth and also very difficult to implement properly. To get it right, adding in the duplicate 'cache' data has to be done entirely post-process, which is expensive for larger trees and will increase the number of disk-writes (which is already our bottleneck). Im not saying we should entirely replace our current dag builder, but adding this as an option should be considered.

@whyrusleeping
Copy link
Member Author

I actually have another idea that should be a good compromise. It will keep all data in the leaf nodes and provide a better layout for streaming. Ill work on that later when i have some extra time. Its very similar to the ext4 block layout model.

image

@whyrusleeping
Copy link
Member Author

Also, just noticed this:

var roughLinkSize = 258 + 8 + 5  // sha256 multihash + size + no name + protobuf framing

We are assuming that a sha256 multihash is 258 bytes, when its actually 34 (32 bytes for sha256 + 2 byte tag)

@jbenet
Copy link
Member

jbenet commented Jan 30, 2015

@whyrusleeping oh wow. we (i) should be more clear when using bit or byte sizes.

(though i dont think we should necessarily increase the # of links per indirect block. seeking is pretty fast right now thanks in part to that.)

@whyrusleeping
Copy link
Member Author

Agreed. Having ~20 or 30 links per block is honestly plenty.

@jbenet
Copy link
Member

jbenet commented Jan 30, 2015

i very much like the ext4 idea.

@whyrusleeping
Copy link
Member Author

Okay, and the more i think about it, the more i like it over whats implemented here.

@jbenet
Copy link
Member

jbenet commented Jan 30, 2015

@whyrusleeping yeah would help opening files over the network :)

@whyrusleeping
Copy link
Member Author

The other structure i thought up is what im calling a "List of Lists"

image

The advantage is that with every request after the first, you receive data, the RTT's required to get the next data block do not increase with the size of the file, but remain constant. Its basically a linked list of arrays of nodes. I also believe that it has fewer intermediary nodes than any other option discussed, which is better for the network overall (fewer values need to be provided). The only downside i can really think of is that as far as trees go, its kindof an ugly tree (0/10 would not decorate in christmas ornaments)

@whyrusleeping
Copy link
Member Author

Alright, so ive come up with a new tree structure optimized for both streaming AND seeking through a given file. This improves both upon the ext4 structure (Which is mainly aimed at on disk filesystems) and the "List of Lists" idea i previously commented about.

The downside of the ext4 style tree layout was that, as you got farther into the file, the number of requests you need to make in order to get data increases, I noticed this problem and came up with the "List of Lists" layout, which would work fantastically for a sequential stream, the issue though, comes when you try to seek through it, the top level node is very poorly weighted to one side so that its 'narrow' from the data's perspective, thus seeking through requires O(n) requests to find the desired location in the file, where ext4 was roughly O(log(n)).

The Trickle{Tree,Dag} addresses both of these concerns, each request after the first can return actual file data, and the cost of seeking remains near O(log(n)) since it has a recursive tree structure. A visualization of it would look like the ext4 tree, but instead of having iteratively deeper 'balanced' trees, it has an iteratively deeper version of itself.

An example layout is here: http://gateway.ipfs.io/ipfs/QmT3mc4wtmyk2Fu1RFMVqvoVgYbDJeoTVnxLM28E4prVvj

@whyrusleeping
Copy link
Member Author

closing in favor of #713

@whyrusleeping whyrusleeping removed the status/in-progress In progress label Feb 1, 2015
@Kubuxu Kubuxu deleted the fast-dag branch February 27, 2017 20:36
ariescodescream pushed a commit to ariescodescream/go-ipfs that referenced this pull request Oct 23, 2021
Hardening Improvements: RT diversity and decreased RT churn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants