Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traversal: track seen links, and revisit only if configured to do so. #252

Closed
wants to merge 3 commits into from

Conversation

warpfork
Copy link
Collaborator

Like it says on the tin.

Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting (path, node) tuples, not just the nodes themselves.

But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path.

With this patch, we track where we've been, at the granularity of links. If we've encountered a link before, we won't load it again.

There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.

Copy link
Contributor

@ribasushi ribasushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
:shipit:

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concerns about memory implications of maintaining this map? I don't think we do but I have a small amount of anxiety.

Also just to confirm: since a map is inherently a pointer, it's shared among all created instances of traversal.Progress in single selector traversal?

Should we add a test to verify this functionality (i.e. LinkRevisit = false, check that we don't traverse a link twice?)

Basically support this PR 100% just want to make sure we're doing a thorough check of the implications for behavior.

Also, among other things, note that two versions of graphsync with and without this patch will be incompatable as the traversal is different. We had probably better make a PR that upgrades the protocol version and sets LinkTraversal = true when talking to a node with the old protocol version.

@Stebalien
Copy link

This logic isn't right. We need to track (Link, Selector) tuples, not just links. That is, we need to avoid visiting the same node with the same selector, but still need to revisit previously seen nodes with different selectors.

For example, we might traverse through a node to visit a single child via one selector, then traverse through that same node to visit all children.

@Stebalien
Copy link

Then we need to avoid sending blocks we've seen multiple times, but that's something we should implement in graphsync.

@hannahhoward
Copy link
Collaborator

hannahhoward commented Sep 15, 2021

Also, fundamentally traversals are visiting (selector, path, node) tuples. This means turning LinkRevisit may result not only in a fewer nodes visited, but some child nodes not visted at all.

consider the following:
block 1 is a map with fields "A" & "B". Field "A" is a link to block 2, Field "B" is a link to block 3. Block 2 is a map with a single field "C" that links to block 4, Block3 is a map with a single field D that links to block 4. Block 4 contains a map that links to a large sub DAG.

Now consider the following selector applied to block 1:

ExploreFields{
   "A": ExploreFields{
       "C": Matcher{}
   },
   "B": ExploreFields{
      "D": ExploreRecursive{
         ExploreAll{
             ExploreRecursiveEdge{}
         }
      }
   }
}

With LinkRevisit = false, the traversal encounteres block 4 first through the "A"/"C" path through block 2, and stops. When it encounters Block 4 a second time through the "B"/"D" path and block 4, it ignores it, meaning the rest of the sub dag below block 4 is never traversed.

With LinkRevisit = true, the traversal loads block 4 among both paths, and on the "B"/"D"/Block 3 path it follows the whole sub DAG, including it in the result set.

This is relevant since our most common use of selectors is to "get me all the blocks that get traversed here" as is used in go-graphsync.

@hannahhoward
Copy link
Collaborator

Follow-up: my comment is simply a restatement of @Stebalien's concern, which I think is serious.

@hannahhoward
Copy link
Collaborator

BTW, I am not sure whether selectors can be easily compare in a map as is done for only links.

@Stebalien
Copy link

Hm. Does the path matter? That is, at every node, selectors yield a set of sub-selectors. If two paths yield the same (Link, Sub-Selector) tuple, I think we can just skip that one. If it's path dependent, that's a problem we should try to solve.

@warpfork
Copy link
Collaborator Author

This logic isn't right. We need to track (Link, Selector) tuples, not just links. That is, we need to avoid visiting the same node with the same selector, but still need to revisit previously seen nodes with different selectors.

Shoot. Yeah, this is true in the general case.

There are also a bunch of cases where that's not really a problem (and those seem to predominate usage that I've seen and heard of so far -- ex1: it's just a blind recursive walk; ex2: protocols tend to have the same structures at the tops of blocks and so usually any logic that's sensitive to how we got here doesn't need to look above block boundaries in practice / has the same selector clauses landing on the same block even when it's been re-reached), in which case this overall has desirable behaviour. But we probably can't say that this is 100% of cases.

@hannahhoward's also right that it's not easy to put selectors in a map key right now. They're not hashable in the way that golang maps would demand. Not sure what we want to do about this.

@Stebalien
Copy link

There are also a bunch of cases where that's not really a problem (and those seem to predominate usage that I've seen and heard of so far -- ex1: it's just a blind recursive walk; ex2: protocols tend to have the same structures at the tops of blocks and so usually any logic that's sensitive to how we got here doesn't need to look above block boundaries in practice / has the same selector clauses landing on the same block even when it's been re-reached), in which case this overall has desirable behaviour. But we probably can't say that this is 100% of cases.

I wouldn't implement this unless we do it correctly.

They're not hashable in the way that golang maps would demand. Not sure what we want to do about this.

They're serializable to IPLD, right? Is there an efficient way to "hash" them that way?

Alternatively, we can probably just have a map[link][]selector and check for equality. The selector set should usually be 0 or 1.

@hannahhoward
Copy link
Collaborator

@Stebalien there are some missing features to serializing selectors to IPLD. Yes, in general selectors are always expressed as IPLD. however, we don't currently have code once we've parsed a selector to its executable form to put it back in IPLD. Moreover, I'm not totally sure if selectors at a specific point in traversal serialize easily. (one would imagine they do, but I remember the implementation being pretty messy for recursive selectors)

Anyway, all over-come-able, but more work.

Sidebar: important thing to consider -- is it possible in all cases to say whether one selector completely contains another selector (or to devise a hueristic for most cases). This is a good question to consider as it has implications both for this problem AND for future efforts at Graphsync request splitting.

@Stebalien
Copy link

Note: these questions are also very relevant for pinning & garbage collection.

I looked into re-serializing selectors at one point, and it didn't look too horrible. But I didn't look closely.

@warpfork
Copy link
Collaborator Author

warpfork commented Sep 16, 2021

The gnarliest bit is (indeed, and perhaps unsurprisingly) around the recursion clauses. The recursion system causes new selector objects to be yielded which don't have a direct correspondence to the original selector spec document that the traversal took as an original parameter. There is also some parts of the recursion selector state that would need to be explicitly excluded if we wanted to get a serialization that's useful for this map key (namely, depth limit decrementors).

So, I think if we made such a reserialization, it's unclear whether that would actually need to be in the same format as specifies a selector. And in any approach where we're trying to get a map key of that complexity, a lot of test coverage would be required to make sure we're doing it right.

@rvagg
Copy link
Member

rvagg commented Sep 25, 2021

I think if we made such a reserialization, it's unclear whether that would actually need to be in the same format as specifies a selector.

We have the option of keeping this an internal concern so it doesn't have to be pretty, just unique and stable for map keys. What kind of options do we have for such a serialization?

I'd be happy to help with filling out a suite of tests of edges to make sure this works properly.

traversal/fns.go Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Sep 29, 2021

I've cherry-picked this into #251 and used the tests I wrote there for SkipMe for this as well.

So I'd like to have this option available in the toolkit for cases where we know it's safe, but we need to be clear about the fact that it's not a hammer to be used on all nails (hence the doc suggestion here which I've incorporated into the cherry-pick). This is fine for an explore-all selector, but not for more complex selectors where revisits with changed selector state is valid.

My only outstanding concern here is that this is an opt-out, you have to set a false to make it not skip. So any existing uses of selectors will get a true unless they opt-out, and that could be a surprise break for those; and it might be a surprise default for people who don't quite understand the nuances here. Could we switch it around, make it SkipLinkRevisit, or NoRevisitLinks?

warpfork and others added 2 commits September 29, 2021 14:58
Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting `(path, node)` tuples, not just the nodes themselves.

But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path.

With this patch, we track where we've been, at the granularity of links.  If we've encountered a link before, we won't load it again.

There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.
…ather than defaulting to on.

This makes it a much less breaking change.
@warpfork
Copy link
Collaborator Author

@rvagg's additional docs added.

Good point about the change of default. I flipped it in the latest commit. The default is now as it was; the new behavior is opt-in. (Correspondingly, did rename the field. Naming bikeshed welcome.)

I think I'd be comfortable merging this now, since it has cautionary docs and is a non-default behavior.

(We also don't close the door to adding much smarter/more complicated memoization behavior in the future which works on (link+selector) tuple keys. I just don't know if I see that getting prioritized highly enough for the amount of work it would take, and suspect there might even be a research-tier problem there, so don't want to block for it either.)

@rvagg
Copy link
Member

rvagg commented Sep 30, 2021

I'm fine with this, though I've pulled the latest in to #251 which includes tests for it. So I'd like that PR considered which gets us coverage here and fixes SkipMe which is still broken on master.

@rvagg
Copy link
Member

rvagg commented Sep 30, 2021

included in #251 which is now merged

@rvagg rvagg closed this Sep 30, 2021
@rvagg rvagg deleted the traversal-track-visited-links branch September 30, 2021 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants