-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506
Conversation
134e1d1
to
affa7ed
Compare
also cc @ribasushi the original gip dream is realised in here, see your removed comment |
affa7ed
to
61fe3b1
Compare
dag := gocar.Dag{Root: c, Selector: selectorparse.CommonSelector_ExploreAllRecursively} | ||
// TraverseLinksOnlyOnce is safe for an exhaustive selector but won't be when we allow | ||
// arbitrary selectors here | ||
car := gocar.NewSelectiveCar(req.Context, store, []gocar.Dag{dag}, gocar.TraverseLinksOnlyOnce()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg is this going to be really slow for network operations? My understanding is that go-merkledag has things like concurrent DAG traversal (and prefetching, although that might just be for UnixFS data) whereas go-ipld-prime traversal does not currently have that.
My guess is that this means gocar's traversal is also going to be slow and linear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, well I'm going to claim ignorance on many of those counts, but maybe @warpfork can help out here
it's true that the vast majority of data going through here will be unixfs, so optimisations for that are going to be beneficial if they make a difference, but maybe the case is to build some of that in to go-ipld-prime instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how much we should worry about "prefetching" or concurrent traversal.
I acknowledge that those features exist in some code, but that does not necessarily imply that they need to exist in the world we're building towards. (Maybe it does; maybe it doesn't. If there's not a benchmark proving it, and a documented user story stating why it's required in context, maybe it's time to review the perceived need.)
ISTM the whole point of CAR files (and a lot of the new systems we're building) are to give people new APIs that explicitly avoid having one query turn into an unbounded number of surprise subqueries of unknown latency. (This in turn is a goal that makes sense because once those horses are allowed to bolt the barn, all possible work is damage control: one can do the Sisyphean work to try to minimize the costs, but one can never make them predictable again, nor as minimal as purely local operation would've been.)
So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.
Which would mean prefetching is radically less relevant. (It'd be widdling around disk access latency rather than worrying about internet network latency, or heaven forbid DHT lookup latency. At that point, it probably drops into "nice to have, as an optimization, that we can add... someday" territory... especially because simply having a predictable local-only mode would still probably be noticably faster than something that hits even one DHT lookup.)
I don't know if that's the only mode of use to care about; at some point that becomes an "ipfs" decision, and I'll leave that decision for someone else. But if I walk through this, starting from visualizing myself as an end user, I'd be looking for APIs that are predictable and fast, and that means I'm looking to not use the network at all, and that really minimizes the relevance of prefetching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISTM users also pretty repeatedly tell us that this is what they're interested in. For example, QRI's bulk transporter, which builds whole graphs of content, and ships them all at once, point to point, is evidence of this: that system does not do network requests for missing pieces, but just moves what they have: and this has worked well for them, to the point that they ditched basically any of our other transport mechanisms that were less predictable.
My understanding is that the holistic goal of CAR-based APIs is to unlock that kind of victory in an even better-supported way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclear to me how much we should worry about "prefetching" or concurrent traversal. ... need to exist in the world we're building towards.
@warpfork if you are not building towards reasonable-to-massive parallelism at every step and layer of your designs, you are building for an already obsolete world. Latencies continue to increase in every piece of interconnect, regardless of its position in the stack. The only salvation is "do more things", especially in the world of quickly-branching DAGs we find ourselves in.
On modern systems with locally-attached NVMe even something as trivial as
echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs cat >/dev/null
is being blown out of the water at a near linear factor by
echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs -n1 -P8 cat >/dev/null
( I need to preload VFS caches quite often, so know this behavior painfully well )
One of the main point of the batteries-included go-ipfs
application is to create (an illusion of) a locality-independent-accessible content. I strongly believe that at the current place of its lifecycle, go-ipfs
does not get the luxury of redefining its implicit API contracts. Nor does any greenfield project ( like e.g. Estuary
) gets to avoid the need for parallel traversal either: if anything, removing much of the architectural debt induced roadblocks, makes the inability to saturate any reasonable hardware even more apparent.
Please reconsider your position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a position.
I'm just saying if somebody finds this important, they'd better figure out how important, and figure out who's going to work on it.
I'd recommend starting with a document about what the user story is that we care about, and a benchmark that gives us a needle to focus on moving. The only thing I'm being cautious about here is letting the thread start asking for work to be done with neither of those in hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.
This seems like a reasonable use of the --offline flag (generally already supported across go-ipfs commands and here as well). However, for those who use the online mode being able to download things with reasonable speed is important.
If there's not a benchmark proving it
Is this a request for a benchmark showing that a linear DAG walk with a latency per retrieval of 1s is slower than one with concurrent traversal and prefetching? Seems unnecessary, but making a basic one seems pretty straightforward if necessary.
and a documented user story stating why it's required in context, maybe it's time to review the perceived need
One is using car export to create verifiable gateways (i.e. trusting the gateway only with availability rather than also data integrity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg I poked around here a bit. The current version of go-car uses slow linear traversal as well so I'm fine with this PR (after we do a go mod tidy
so tests pass).
However, it seems likely we're going to have to revisit this topic as linear DAG traversal is going to be a problem for folks interested in verifiable gateways or generally folks interested in pulling CAR files out of the network (or off of high latency disks which is true for a number of our collabs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warpfork prefetching is a a very high priority and I feel like I've been saying this for a while. It's already limiting graphsync performance to disk latency, and basically means we can't use selectors anywhere in go-ipfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tell you that the rate-limit will be blockSize/rtt
which limits us to a few MiB/s assuming 256KiB blocks. Assuming smaller data, it gets much worse. If you want a benchmark, you can pretty easily test ipfs dag export
(no prefetching) versus ipfs pin add
(parallel download) versus ipfs get
(some prefetching that kind of sucks).
go-car@v0.3.2 tagged, updated here and A few of us also chatted today about optimisation possibilities in go-ipld-prime's traversals that could help address some of the concerns mentioned in the thread above. We think that there's some relatively straightforward possibilities for easy cases, we just need to be able to prioritise that work (including testing and verification that it's actually worth it!). |
What's the next step here? If I'm understanding the thread correctly:
Anyways, is there more that needs to be discussed, or can this get merged? |
I think this can be merged. As mentioned, I'd like to build on it with custom selectors, but I'm currently focusing on playing with that further building on Will's work in go-car (ipld/go-car#260). There's a lot of complexity and performance concerns when you get to arbitrary selectors so I figure it's best to be playing outside of go-ipfs first. But this PR is the first step in getting there. |
8fe19d2
to
dc6c208
Compare
@rvagg : sounds good. I assume you'll look into getting the tests to pass. Assuming we merge by mid this week, it'll be included in the go-ipfs 0.11 RC we plan to cut by end of week. |
I've tried, but I don't understand why it doesn't like this:
Works locally for me. There's a go.sum in go-ipfs right here: https://github.com/ipld/go-car/blob/v0.3.2/go.sum, and it's in the go-ipfs go.sum right here in my PR: https://github.com/ipfs/go-ipfs/pull/8506/files#diff-3295df7234525439d778f1b282d146a4f1ff6b415248aaac074e8042d9f42d63 |
… export This is "safe" now because we can limit duplicate block loads like go-merkledag does and won't get trapped taking a long time for complex DAGs. We can do this while we're using an exhaustive selector (like ExploreAll here) but will need an alternative strategy when we go for arbitrary selectors.
dc6c208
to
3305059
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rvagg
This is "safe" now because we can limit duplicate block loads like
go-merkledag does and won't get trapped taking a long time for complex
DAGs. We can do this while we're using an exhaustive selector (like
ExploreAll here) but will need an alternative strategy when we go for
arbitrary selectors.
Depends on getting a go-car tag, which I think we should just do now (ref ipld/go-car#254) for v0.3.2.
I'm hoping this helps provide an alternative path to #8183 although I'm not sure and haven't tested for that specific use-case. It'd just be nice to not use more go-merkledag_ but less, and also provide a clear path to implementing arbitrary selectors (that was also part of my changes here which I've backed out, I didn't finish as I got distracted fixing something closely related in Lotus!). If we bake go-merkledag in further here as #8503 does then we cut ourselves off from that, or at least make it much more difficult.