-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
traversal: track seen links, and revisit only if configured to do so. #252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any concerns about memory implications of maintaining this map? I don't think we do but I have a small amount of anxiety.
Also just to confirm: since a map is inherently a pointer, it's shared among all created instances of traversal.Progress in single selector traversal?
Should we add a test to verify this functionality (i.e. LinkRevisit = false, check that we don't traverse a link twice?)
Basically support this PR 100% just want to make sure we're doing a thorough check of the implications for behavior.
Also, among other things, note that two versions of graphsync with and without this patch will be incompatable as the traversal is different. We had probably better make a PR that upgrades the protocol version and sets LinkTraversal = true when talking to a node with the old protocol version.
This logic isn't right. We need to track For example, we might traverse through a node to visit a single child via one selector, then traverse through that same node to visit all children. |
Then we need to avoid sending blocks we've seen multiple times, but that's something we should implement in graphsync. |
Also, fundamentally traversals are visiting (selector, path, node) tuples. This means turning LinkRevisit may result not only in a fewer nodes visited, but some child nodes not visted at all. consider the following: Now consider the following selector applied to block 1: ExploreFields{
"A": ExploreFields{
"C": Matcher{}
},
"B": ExploreFields{
"D": ExploreRecursive{
ExploreAll{
ExploreRecursiveEdge{}
}
}
}
} With LinkRevisit = false, the traversal encounteres block 4 first through the "A"/"C" path through block 2, and stops. When it encounters Block 4 a second time through the "B"/"D" path and block 4, it ignores it, meaning the rest of the sub dag below block 4 is never traversed. With LinkRevisit = true, the traversal loads block 4 among both paths, and on the "B"/"D"/Block 3 path it follows the whole sub DAG, including it in the result set. This is relevant since our most common use of selectors is to "get me all the blocks that get traversed here" as is used in go-graphsync. |
Follow-up: my comment is simply a restatement of @Stebalien's concern, which I think is serious. |
BTW, I am not sure whether selectors can be easily compare in a map as is done for only links. |
Hm. Does the path matter? That is, at every node, selectors yield a set of sub-selectors. If two paths yield the same |
Shoot. Yeah, this is true in the general case. There are also a bunch of cases where that's not really a problem (and those seem to predominate usage that I've seen and heard of so far -- ex1: it's just a blind recursive walk; ex2: protocols tend to have the same structures at the tops of blocks and so usually any logic that's sensitive to how we got here doesn't need to look above block boundaries in practice / has the same selector clauses landing on the same block even when it's been re-reached), in which case this overall has desirable behaviour. But we probably can't say that this is 100% of cases. @hannahhoward's also right that it's not easy to put selectors in a map key right now. They're not hashable in the way that golang maps would demand. Not sure what we want to do about this. |
I wouldn't implement this unless we do it correctly.
They're serializable to IPLD, right? Is there an efficient way to "hash" them that way? Alternatively, we can probably just have a |
@Stebalien there are some missing features to serializing selectors to IPLD. Yes, in general selectors are always expressed as IPLD. however, we don't currently have code once we've parsed a selector to its executable form to put it back in IPLD. Moreover, I'm not totally sure if selectors at a specific point in traversal serialize easily. (one would imagine they do, but I remember the implementation being pretty messy for recursive selectors) Anyway, all over-come-able, but more work. Sidebar: important thing to consider -- is it possible in all cases to say whether one selector completely contains another selector (or to devise a hueristic for most cases). This is a good question to consider as it has implications both for this problem AND for future efforts at Graphsync request splitting. |
Note: these questions are also very relevant for pinning & garbage collection. I looked into re-serializing selectors at one point, and it didn't look too horrible. But I didn't look closely. |
The gnarliest bit is (indeed, and perhaps unsurprisingly) around the recursion clauses. The recursion system causes new selector objects to be yielded which don't have a direct correspondence to the original selector spec document that the traversal took as an original parameter. There is also some parts of the recursion selector state that would need to be explicitly excluded if we wanted to get a serialization that's useful for this map key (namely, depth limit decrementors). So, I think if we made such a reserialization, it's unclear whether that would actually need to be in the same format as specifies a selector. And in any approach where we're trying to get a map key of that complexity, a lot of test coverage would be required to make sure we're doing it right. |
We have the option of keeping this an internal concern so it doesn't have to be pretty, just unique and stable for map keys. What kind of options do we have for such a serialization? I'd be happy to help with filling out a suite of tests of edges to make sure this works properly. |
I've cherry-picked this into #251 and used the tests I wrote there for So I'd like to have this option available in the toolkit for cases where we know it's safe, but we need to be clear about the fact that it's not a hammer to be used on all nails (hence the doc suggestion here which I've incorporated into the cherry-pick). This is fine for an explore-all selector, but not for more complex selectors where revisits with changed selector state is valid. My only outstanding concern here is that this is an opt-out, you have to set a |
Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting `(path, node)` tuples, not just the nodes themselves. But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path. With this patch, we track where we've been, at the granularity of links. If we've encountered a link before, we won't load it again. There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.
Co-authored-by: Rod Vagg <rod@vagg.org>
3f96c4b
to
71088ab
Compare
…ather than defaulting to on. This makes it a much less breaking change.
@rvagg's additional docs added. Good point about the change of default. I flipped it in the latest commit. The default is now as it was; the new behavior is opt-in. (Correspondingly, did rename the field. Naming bikeshed welcome.) I think I'd be comfortable merging this now, since it has cautionary docs and is a non-default behavior. (We also don't close the door to adding much smarter/more complicated memoization behavior in the future which works on |
I'm fine with this, though I've pulled the latest in to #251 which includes tests for it. So I'd like that PR considered which gets us coverage here and fixes |
included in #251 which is now merged |
Like it says on the tin.
Normally, it's possible to visit the same blocks of data repeatedly during a traversal -- which, if you look at the world with sufficiently strict definitions, "makes sense", because traversals are really visiting
(path, node)
tuples, not just the nodes themselves.But often, we find people really do only care about visiting nodes and don't particularly care to have the traversal inform them if it reaches that same data again by a different path.
With this patch, we track where we've been, at the granularity of links. If we've encountered a link before, we won't load it again.
There's a configuration option for this, because in rare occasions, I could imagine not wanting this dedup; also, if one knows one doesn't need it, it may be valuable to turn it off because it does of course consume a small amount of memory to track where we've been.