chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

rvagg · 2021-10-12T02:59:21Z

This is "safe" now because we can limit duplicate block loads like
go-merkledag does and won't get trapped taking a long time for complex
DAGs. We can do this while we're using an exhaustive selector (like
ExploreAll here) but will need an alternative strategy when we go for
arbitrary selectors.

Depends on getting a go-car tag, which I think we should just do now (ref ipld/go-car#254) for v0.3.2.

I'm hoping this helps provide an alternative path to #8183 although I'm not sure and haven't tested for that specific use-case. It'd just be nice to not use more go-merkledag_ but less, and also provide a clear path to implementing arbitrary selectors (that was also part of my changes here which I've backed out, I didn't finish as I got distracted fixing something closely related in Lotus!). If we bake go-merkledag in further here as #8503 does then we cut ourselves off from that, or at least make it much more difficult.

rvagg · 2021-10-12T03:01:27Z

also cc @ribasushi the original gip dream is realised in here, see your removed comment

aschmahmann · 2021-10-12T03:04:33Z

core/commands/dag/export.go

+		dag := gocar.Dag{Root: c, Selector: selectorparse.CommonSelector_ExploreAllRecursively}
+		// TraverseLinksOnlyOnce is safe for an exhaustive selector but won't be when we allow
+		// arbitrary selectors here
+		car := gocar.NewSelectiveCar(req.Context, store, []gocar.Dag{dag}, gocar.TraverseLinksOnlyOnce())


@rvagg is this going to be really slow for network operations? My understanding is that go-merkledag has things like concurrent DAG traversal (and prefetching, although that might just be for UnixFS data) whereas go-ipld-prime traversal does not currently have that.

My guess is that this means gocar's traversal is also going to be slow and linear.

oh, well I'm going to claim ignorance on many of those counts, but maybe @warpfork can help out here

it's true that the vast majority of data going through here will be unixfs, so optimisations for that are going to be beneficial if they make a difference, but maybe the case is to build some of that in to go-ipld-prime instead

It's unclear to me how much we should worry about "prefetching" or concurrent traversal.

I acknowledge that those features exist in some code, but that does not necessarily imply that they need to exist in the world we're building towards. (Maybe it does; maybe it doesn't. If there's not a benchmark proving it, and a documented user story stating why it's required in context, maybe it's time to review the perceived need.)

ISTM the whole point of CAR files (and a lot of the new systems we're building) are to give people new APIs that explicitly avoid having one query turn into an unbounded number of surprise subqueries of unknown latency. (This in turn is a goal that makes sense because once those horses are allowed to bolt the barn, all possible work is damage control: one can do the Sisyphean work to try to minimize the costs, but one can never make them predictable again, nor as minimal as purely local operation would've been.)

So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.

Which would mean prefetching is radically less relevant. (It'd be widdling around disk access latency rather than worrying about internet network latency, or heaven forbid DHT lookup latency. At that point, it probably drops into "nice to have, as an optimization, that we can add... someday" territory... especially because simply having a predictable local-only mode would still probably be noticably faster than something that hits even one DHT lookup.)

I don't know if that's the only mode of use to care about; at some point that becomes an "ipfs" decision, and I'll leave that decision for someone else. But if I walk through this, starting from visualizing myself as an end user, I'd be looking for APIs that are predictable and fast, and that means I'm looking to not use the network at all, and that really minimizes the relevance of prefetching.

ISTM users also pretty repeatedly tell us that this is what they're interested in. For example, QRI's bulk transporter, which builds whole graphs of content, and ships them all at once, point to point, is evidence of this: that system does not do network requests for missing pieces, but just moves what they have: and this has worked well for them, to the point that they ditched basically any of our other transport mechanisms that were less predictable.

My understanding is that the holistic goal of CAR-based APIs is to unlock that kind of victory in an even better-supported way.

unclear to me how much we should worry about "prefetching" or concurrent traversal. ... need to exist in the world we're building towards.

@warpfork if you are not building towards reasonable-to-massive parallelism at every step and layer of your designs, you are building for an already obsolete world. Latencies continue to increase in every piece of interconnect, regardless of its position in the stack. The only salvation is "do more things", especially in the world of quickly-branching DAGs we find ourselves in.

On modern systems with locally-attached NVMe even something as trivial as

echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs cat >/dev/null

is being blown out of the water at a near linear factor by

echo /same-fs/file-{1,2,3,4,5,6,7,8} | xargs -n1 -P8 cat >/dev/null

( I need to preload VFS caches quite often, so know this behavior painfully well )

One of the main point of the batteries-included go-ipfs application is to create (an illusion of) a locality-independent-accessible content. I strongly believe that at the current place of its lifecycle, go-ipfs does not get the luxury of redefining its implicit API contracts. Nor does any greenfield project ( like e.g. Estuary ) gets to avoid the need for parallel traversal either: if anything, removing much of the architectural debt induced roadblocks, makes the inability to saturate any reasonable hardware even more apparent.

Please reconsider your position.

I don't have a position.

I'm just saying if somebody finds this important, they'd better figure out how important, and figure out who's going to work on it.

I'd recommend starting with a document about what the user story is that we care about, and a benchmark that gives us a needle to focus on moving. The only thing I'm being cautious about here is letting the thread start asking for work to be done with neither of those in hand.

So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.

This seems like a reasonable use of the --offline flag (generally already supported across go-ipfs commands and here as well). However, for those who use the online mode being able to download things with reasonable speed is important.

If there's not a benchmark proving it

Is this a request for a benchmark showing that a linear DAG walk with a latency per retrieval of 1s is slower than one with concurrent traversal and prefetching? Seems unnecessary, but making a basic one seems pretty straightforward if necessary.

and a documented user story stating why it's required in context, maybe it's time to review the perceived need

One is using car export to create verifiable gateways (i.e. trusting the gateway only with availability rather than also data integrity).

@rvagg I poked around here a bit. The current version of go-car uses slow linear traversal as well so I'm fine with this PR (after we do a go mod tidy so tests pass).

However, it seems likely we're going to have to revisit this topic as linear DAG traversal is going to be a problem for folks interested in verifiable gateways or generally folks interested in pulling CAR files out of the network (or off of high latency disks which is true for a number of our collabs).

@warpfork prefetching is a a very high priority and I feel like I've been saying this for a while. It's already limiting graphsync performance to disk latency, and basically means we can't use selectors anywhere in go-ipfs.

I can tell you that the rate-limit will be blockSize/rtt which limits us to a few MiB/s assuming 256KiB blocks. Assuming smaller data, it gets much worse. If you want a benchmark, you can pretty easily test ipfs dag export (no prefetching) versus ipfs pin add (parallel download) versus ipfs get (some prefetching that kind of sucks).

go.mod

rvagg · 2021-10-13T02:29:19Z

go-car@v0.3.2 tagged, updated here and go mod tidy done, this should be gtg if we get greens

A few of us also chatted today about optimisation possibilities in go-ipld-prime's traversals that could help address some of the concerns mentioned in the thread above. We think that there's some relatively straightforward possibilities for easy cases, we just need to be able to prioritise that work (including testing and verification that it's actually worth it!).

BigLep · 2021-10-24T05:40:44Z

What's the next step here? If I'm understanding the thread correctly:

This code isn't making the current situation worse, right?
Some usecases were identified where prefetching is needed/expected (e.g., "car export for verifiable gateways") and we have existing experience of what this looks like with and without prefetching (per @Stebalien's comment above) to have a sense of the performance impact.

Anyways, is there more that needs to be discussed, or can this get merged?

rvagg · 2021-10-25T03:16:03Z

I think this can be merged. As mentioned, I'd like to build on it with custom selectors, but I'm currently focusing on playing with that further building on Will's work in go-car (ipld/go-car#260). There's a lot of complexity and performance concerns when you get to arbitrary selectors so I figure it's best to be playing outside of go-ipfs first. But this PR is the first step in getting there.

BigLep · 2021-10-25T04:59:50Z

@rvagg : sounds good. I assume you'll look into getting the tests to pass. Assuming we merge by mid this week, it'll be included in the go-ipfs 0.11 RC we plan to cut by end of week.

rvagg · 2021-10-25T06:00:10Z

I've tried, but I don't understand why it doesn't like this:

#!/bin/bash -eo pipefail
go test -v ./...
go: github.com/ipfs/go-ipfs@v0.9.1 requires
	github.com/ipld/go-car@v0.3.2: missing go.sum entry; to add it:
	go mod download github.com/ipld/go-car

Exited with code exit status 1
CircleCI received exit code 1

Works locally for me. There's a go.sum in go-ipfs right here: https://github.com/ipld/go-car/blob/v0.3.2/go.sum, and it's in the go-ipfs go.sum right here in my PR: https://github.com/ipfs/go-ipfs/pull/8506/files#diff-3295df7234525439d778f1b282d146a4f1ff6b415248aaac074e8042d9f42d63
And why is it reporting github.com/ipfs/go-ipfs@v0.9.1?
This one's beyond me, needs a Go person. @aschmahmann?

aschmahmann · 2021-10-26T15:57:04Z

Ya, @rvagg this is related to the examples and is resolved by #8516. I'll rebase on master and see how we're doing

… export This is "safe" now because we can limit duplicate block loads like go-merkledag does and won't get trapped taking a long time for complex DAGs. We can do this while we're using an exhaustive selector (like ExploreAll here) but will need an alternative strategy when we go for arbitrary selectors.

…for dag export

aschmahmann

Thanks @rvagg

rvagg requested a review from coryschwartz October 12, 2021 02:59

rvagg force-pushed the rvagg/export-select branch from 134e1d1 to affa7ed Compare October 12, 2021 03:00

rvagg force-pushed the rvagg/export-select branch from affa7ed to 61fe3b1 Compare October 12, 2021 03:02

rvagg mentioned this pull request Oct 12, 2021

Feat/car iter #8503

Closed

aschmahmann reviewed Oct 12, 2021

View reviewed changes

go.mod Outdated Show resolved Hide resolved

rvagg mentioned this pull request Oct 21, 2021

feat: fix get-dag and add version=1 option ipld/go-car#260

Merged

rvagg force-pushed the rvagg/export-select branch from 8fe19d2 to dc6c208 Compare October 25, 2021 03:37

BigLep added this to the go-ipfs 0.11 milestone Oct 25, 2021

rvagg added 2 commits October 26, 2021 12:08

fixup! chore: replace go-merkledag walk with go-ipld-prime traversal …

3305059

…for dag export

aschmahmann force-pushed the rvagg/export-select branch from dc6c208 to 3305059 Compare October 26, 2021 16:08

aschmahmann approved these changes Oct 26, 2021

View reviewed changes

aschmahmann merged commit 9e8b6e5 into master Oct 26, 2021

aschmahmann deleted the rvagg/export-select branch October 26, 2021 18:07

rvagg mentioned this pull request Nov 17, 2021

TODO: DAG walk for Block API multiformats/js-multiformats#118

Closed

guseggert mentioned this pull request Nov 23, 2021

Release v0.11 #8343

Closed

80 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

rvagg commented Oct 12, 2021

rvagg commented Oct 12, 2021

aschmahmann Oct 12, 2021

rvagg Oct 12, 2021

warpfork Oct 12, 2021 •

edited

Loading

warpfork Oct 12, 2021

ribasushi Oct 12, 2021

warpfork Oct 12, 2021 •

edited

Loading

aschmahmann Oct 12, 2021

aschmahmann Oct 12, 2021

Stebalien Oct 12, 2021

Stebalien Oct 12, 2021

rvagg commented Oct 13, 2021

BigLep commented Oct 24, 2021

rvagg commented Oct 25, 2021

BigLep commented Oct 25, 2021

rvagg commented Oct 25, 2021

aschmahmann commented Oct 26, 2021

aschmahmann left a comment

chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506

Conversation

rvagg commented Oct 12, 2021

rvagg commented Oct 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

warpfork Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Oct 13, 2021

BigLep commented Oct 24, 2021

rvagg commented Oct 25, 2021

BigLep commented Oct 25, 2021

rvagg commented Oct 25, 2021

aschmahmann commented Oct 26, 2021

aschmahmann left a comment

Choose a reason for hiding this comment

warpfork Oct 12, 2021 •

edited

Loading

warpfork Oct 12, 2021 •

edited

Loading