traversal: track seen links, and revisit only if configured to do so. #252

warpfork · 2021-09-15T12:35:22Z

Like it says on the tin.

Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting (path, node) tuples, not just the nodes themselves.

But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path.

With this patch, we track where we've been, at the granularity of links. If we've encountered a link before, we won't load it again.

There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.

ribasushi

👍

hannahhoward

Do we have any concerns about memory implications of maintaining this map? I don't think we do but I have a small amount of anxiety.

Also just to confirm: since a map is inherently a pointer, it's shared among all created instances of traversal.Progress in single selector traversal?

Should we add a test to verify this functionality (i.e. LinkRevisit = false, check that we don't traverse a link twice?)

Basically support this PR 100% just want to make sure we're doing a thorough check of the implications for behavior.

Also, among other things, note that two versions of graphsync with and without this patch will be incompatable as the traversal is different. We had probably better make a PR that upgrades the protocol version and sets LinkTraversal = true when talking to a node with the old protocol version.

Stebalien · 2021-09-15T14:45:58Z

This logic isn't right. We need to track (Link, Selector) tuples, not just links. That is, we need to avoid visiting the same node with the same selector, but still need to revisit previously seen nodes with different selectors.

For example, we might traverse through a node to visit a single child via one selector, then traverse through that same node to visit all children.

Stebalien · 2021-09-15T14:47:46Z

Then we need to avoid sending blocks we've seen multiple times, but that's something we should implement in graphsync.

hannahhoward · 2021-09-15T14:48:12Z

Also, fundamentally traversals are visiting (selector, path, node) tuples. This means turning LinkRevisit may result not only in a fewer nodes visited, but some child nodes not visted at all.

consider the following:
block 1 is a map with fields "A" & "B". Field "A" is a link to block 2, Field "B" is a link to block 3. Block 2 is a map with a single field "C" that links to block 4, Block3 is a map with a single field D that links to block 4. Block 4 contains a map that links to a large sub DAG.

Now consider the following selector applied to block 1:

ExploreFields{
   "A": ExploreFields{
       "C": Matcher{}
   },
   "B": ExploreFields{
      "D": ExploreRecursive{
         ExploreAll{
             ExploreRecursiveEdge{}
         }
      }
   }
}

With LinkRevisit = false, the traversal encounteres block 4 first through the "A"/"C" path through block 2, and stops. When it encounters Block 4 a second time through the "B"/"D" path and block 4, it ignores it, meaning the rest of the sub dag below block 4 is never traversed.

With LinkRevisit = true, the traversal loads block 4 among both paths, and on the "B"/"D"/Block 3 path it follows the whole sub DAG, including it in the result set.

This is relevant since our most common use of selectors is to "get me all the blocks that get traversed here" as is used in go-graphsync.

hannahhoward · 2021-09-15T14:49:28Z

Follow-up: my comment is simply a restatement of @Stebalien's concern, which I think is serious.

hannahhoward · 2021-09-15T14:51:09Z

BTW, I am not sure whether selectors can be easily compare in a map as is done for only links.

Stebalien · 2021-09-15T15:17:43Z

Hm. Does the path matter? That is, at every node, selectors yield a set of sub-selectors. If two paths yield the same (Link, Sub-Selector) tuple, I think we can just skip that one. If it's path dependent, that's a problem we should try to solve.

warpfork · 2021-09-15T16:05:39Z

This logic isn't right. We need to track (Link, Selector) tuples, not just links. That is, we need to avoid visiting the same node with the same selector, but still need to revisit previously seen nodes with different selectors.

Shoot. Yeah, this is true in the general case.

There are also a bunch of cases where that's not really a problem (and those seem to predominate usage that I've seen and heard of so far -- ex1: it's just a blind recursive walk; ex2: protocols tend to have the same structures at the tops of blocks and so usually any logic that's sensitive to how we got here doesn't need to look above block boundaries in practice / has the same selector clauses landing on the same block even when it's been re-reached), in which case this overall has desirable behaviour. But we probably can't say that this is 100% of cases.

@hannahhoward's also right that it's not easy to put selectors in a map key right now. They're not hashable in the way that golang maps would demand. Not sure what we want to do about this.

Stebalien · 2021-09-15T16:19:02Z

There are also a bunch of cases where that's not really a problem (and those seem to predominate usage that I've seen and heard of so far -- ex1: it's just a blind recursive walk; ex2: protocols tend to have the same structures at the tops of blocks and so usually any logic that's sensitive to how we got here doesn't need to look above block boundaries in practice / has the same selector clauses landing on the same block even when it's been re-reached), in which case this overall has desirable behaviour. But we probably can't say that this is 100% of cases.

I wouldn't implement this unless we do it correctly.

They're not hashable in the way that golang maps would demand. Not sure what we want to do about this.

They're serializable to IPLD, right? Is there an efficient way to "hash" them that way?

Alternatively, we can probably just have a map[link][]selector and check for equality. The selector set should usually be 0 or 1.

hannahhoward · 2021-09-15T19:08:33Z

@Stebalien there are some missing features to serializing selectors to IPLD. Yes, in general selectors are always expressed as IPLD. however, we don't currently have code once we've parsed a selector to its executable form to put it back in IPLD. Moreover, I'm not totally sure if selectors at a specific point in traversal serialize easily. (one would imagine they do, but I remember the implementation being pretty messy for recursive selectors)

Anyway, all over-come-able, but more work.

Sidebar: important thing to consider -- is it possible in all cases to say whether one selector completely contains another selector (or to devise a hueristic for most cases). This is a good question to consider as it has implications both for this problem AND for future efforts at Graphsync request splitting.

Stebalien · 2021-09-15T19:22:16Z

Note: these questions are also very relevant for pinning & garbage collection.

I looked into re-serializing selectors at one point, and it didn't look too horrible. But I didn't look closely.

warpfork · 2021-09-16T08:27:30Z

The gnarliest bit is (indeed, and perhaps unsurprisingly) around the recursion clauses. The recursion system causes new selector objects to be yielded which don't have a direct correspondence to the original selector spec document that the traversal took as an original parameter. There is also some parts of the recursion selector state that would need to be explicitly excluded if we wanted to get a serialization that's useful for this map key (namely, depth limit decrementors).

So, I think if we made such a reserialization, it's unclear whether that would actually need to be in the same format as specifies a selector. And in any approach where we're trying to get a map key of that complexity, a lot of test coverage would be required to make sure we're doing it right.

rvagg · 2021-09-25T06:12:37Z

I think if we made such a reserialization, it's unclear whether that would actually need to be in the same format as specifies a selector.

We have the option of keeping this an internal concern so it doesn't have to be pretty, just unique and stable for map keys. What kind of options do we have for such a serialization?

I'd be happy to help with filling out a suite of tests of edges to make sure this works properly.

traversal/fns.go

rvagg · 2021-09-29T06:33:47Z

I've cherry-picked this into #251 and used the tests I wrote there for SkipMe for this as well.

So I'd like to have this option available in the toolkit for cases where we know it's safe, but we need to be clear about the fact that it's not a hammer to be used on all nails (hence the doc suggestion here which I've incorporated into the cherry-pick). This is fine for an explore-all selector, but not for more complex selectors where revisits with changed selector state is valid.

My only outstanding concern here is that this is an opt-out, you have to set a false to make it not skip. So any existing uses of selectors will get a true unless they opt-out, and that could be a surprise break for those; and it might be a surprise default for people who don't quite understand the nuances here. Could we switch it around, make it SkipLinkRevisit, or NoRevisitLinks?

Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting `(path, node)` tuples, not just the nodes themselves. But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path. With this patch, we track where we've been, at the granularity of links. If we've encountered a link before, we won't load it again. There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.

Co-authored-by: Rod Vagg <rod@vagg.org>

…ather than defaulting to on. This makes it a much less breaking change.

warpfork · 2021-09-29T13:36:15Z

@rvagg's additional docs added.

Good point about the change of default. I flipped it in the latest commit. The default is now as it was; the new behavior is opt-in. (Correspondingly, did rename the field. Naming bikeshed welcome.)

I think I'd be comfortable merging this now, since it has cautionary docs and is a non-default behavior.

(We also don't close the door to adding much smarter/more complicated memoization behavior in the future which works on (link+selector) tuple keys. I just don't know if I see that getting prioritized highly enough for the amount of work it would take, and suspect there might even be a research-tier problem there, so don't want to block for it either.)

rvagg · 2021-09-30T06:49:38Z

I'm fine with this, though I've pulled the latest in to #251 which includes tests for it. So I'd like that PR considered which gets us coverage here and fixes SkipMe which is still broken on master.

rvagg · 2021-09-30T10:50:28Z

included in #251 which is now merged

warpfork requested review from hannahhoward and ribasushi September 15, 2021 12:35

ribasushi approved these changes Sep 15, 2021

View reviewed changes

hannahhoward reviewed Sep 15, 2021

View reviewed changes

rvagg reviewed Sep 29, 2021

View reviewed changes

traversal/fns.go Outdated Show resolved Hide resolved

rvagg mentioned this pull request Sep 29, 2021

Don't abort block processing when encountering SkipMe #251

Merged

warpfork and others added 2 commits September 29, 2021 14:58

traversal: additional cautionary docs regarding LinkRevisit toggle

71088ab

Co-authored-by: Rod Vagg <rod@vagg.org>

warpfork force-pushed the traversal-track-visited-links branch from 3f96c4b to 71088ab Compare September 29, 2021 12:59

traversal: the link-visit-only-once behavior should require opt-in, r…

10cb7bc

…ather than defaulting to on. This makes it a much less breaking change.

rvagg closed this Sep 30, 2021

rvagg deleted the traversal-track-visited-links branch September 30, 2021 10:50

rvagg mentioned this pull request Oct 1, 2021

Expose selector traversal options for SelectiveCar ipld/go-car#251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

traversal: track seen links, and revisit only if configured to do so. #252

traversal: track seen links, and revisit only if configured to do so. #252

warpfork commented Sep 15, 2021

ribasushi left a comment

hannahhoward left a comment

Stebalien commented Sep 15, 2021

Stebalien commented Sep 15, 2021

hannahhoward commented Sep 15, 2021 •

edited

Loading

hannahhoward commented Sep 15, 2021

hannahhoward commented Sep 15, 2021

Stebalien commented Sep 15, 2021

warpfork commented Sep 15, 2021

Stebalien commented Sep 15, 2021

hannahhoward commented Sep 15, 2021

Stebalien commented Sep 15, 2021

warpfork commented Sep 16, 2021 •

edited

Loading

rvagg commented Sep 25, 2021

rvagg commented Sep 29, 2021

warpfork commented Sep 29, 2021

rvagg commented Sep 30, 2021

rvagg commented Sep 30, 2021

traversal: track seen links, and revisit only if configured to do so. #252

traversal: track seen links, and revisit only if configured to do so. #252

Conversation

warpfork commented Sep 15, 2021

ribasushi left a comment

Choose a reason for hiding this comment

hannahhoward left a comment

Choose a reason for hiding this comment

Stebalien commented Sep 15, 2021

Stebalien commented Sep 15, 2021

hannahhoward commented Sep 15, 2021 • edited Loading

hannahhoward commented Sep 15, 2021

hannahhoward commented Sep 15, 2021

Stebalien commented Sep 15, 2021

warpfork commented Sep 15, 2021

Stebalien commented Sep 15, 2021

hannahhoward commented Sep 15, 2021

Stebalien commented Sep 15, 2021

warpfork commented Sep 16, 2021 • edited Loading

rvagg commented Sep 25, 2021

rvagg commented Sep 29, 2021

warpfork commented Sep 29, 2021

rvagg commented Sep 30, 2021

rvagg commented Sep 30, 2021

hannahhoward commented Sep 15, 2021 •

edited

Loading

warpfork commented Sep 16, 2021 •

edited

Loading