Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: depth limited refs -r #5337

Merged
merged 2 commits into from
Aug 27, 2018
Merged

Feat: depth limited refs -r #5337

merged 2 commits into from
Aug 27, 2018

Conversation

hsanjuan
Copy link
Contributor

@hsanjuan hsanjuan commented Aug 3, 2018

This adds --max-depth to the "refs" commands and allows limiting
the fetching of refs per depth. Other than that, it works as before.

Note that clever branch pruning is only made when the --unique flag
is passed. Otherwise, we re-explore branches to the given depth.


First minimal approach. I wonder if we could utilize the EnumerateChildren functions here instead, @Stebalien ? Doing the printing inside the custom visit function. I'm not sure if there are reasons why the DAG traversal logic was re-implemented separately for this command.

Also, now or later, I would like to do an --async version of this, so it would be very easy if we re-use EnumerateChildrenAsync().

@hsanjuan hsanjuan self-assigned this Aug 3, 2018
@hsanjuan hsanjuan requested a review from Stebalien August 3, 2018 10:31
@hsanjuan hsanjuan requested a review from Kubuxu as a code owner August 3, 2018 10:31
@ghost ghost added the status/in-progress In progress label Aug 3, 2018
@hsanjuan
Copy link
Contributor Author

hsanjuan commented Aug 3, 2018

Note: i am missing some sharness tests still. Will do that when we have a final approach.

// true otherwise. The second return argument indicates whether the Cid was seen
// before.
func (rw *RefWriter) visit(c *cid.Cid, depth int) (bool, bool) {
if rw.seen == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe disabling unique by default is a memory optimization. If it's disabled, we should avoid using memory linear in the number of keys. We should probably have check for the Unique flag up-top and, in that case, only check the depth (don't store anything).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be simpler to have one boolean indicate that we should continue traversing and the other indicate that we should return the CID to the user. That may simplify some of the other logic. It'll also allow us to say "return this but don't traverse it which, unless I'm mistaken, should save us a bit of work (possibly saving us from fetching a node we don't need to traverse).

@Stebalien
Copy link
Member

I'm not sure if there are reasons why the DAG traversal logic was re-implemented separately for this command.

I believe this command came before the EnumerateChildren functions. We could probably re-use that logic (although we should probably use the non-async one unless the user passes some --in-order=false flag as users may be relying on traversal order...).

@Stebalien Stebalien added need/author-input Needs input from the original author need_tests labels Aug 16, 2018
@hsanjuan
Copy link
Contributor Author

@Stebalien:

  1. I think I have addressed both of your comments. Traversal logic is slightly simpler now.

  2. I tried to use merkledag.EnumerateChildren but doesn't work maintaining the custom --format options (needs parent and linkname, apart from the curren Cid), so I left things as they were. Do you think we should provide an async/--in-oder=false method where we disregard the custom --format flag? I'd rather not duplicate the EnumerateChildrenAsync logic. (this would be different PR).

return 0, nil
// visit returns two values:
// - first indicates if we should keep traversing the DAG.
// - second indicates if the given Cid should be printed to the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"indicates" is ambiguous. In the first case, it means "is true" and, in the second case, "is false".

Personally, I'd rather:

  1. Say "is true" (explicitly).
  2. Invert the second case (i.e., true means print).

nc := n.Cid()

var count int
for i, ng := range ipld.GetDAG(rw.Ctx, rw.DAG, n) {
lc := n.Links()[i].Cid
if rw.skip(lc) {
continue
unexplored, written := rw.visit(lc, depth+1) // The children are at depth+1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"written" isn't quite correct. Really, it means "don't write" (we may not have already written it, unless I'm mis-reading the code).

Note: if we invert this case, this'll obviously become "shouldWrite" or something like that. I'm just dropping a comment here so we don't miss it.

// We do not track a set of visited nodes in this case.
// We do not print anything too deep though.
if !rw.Unique {
return !overMaxDepth, overMaxDepth
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to not explore nodes at max depth as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overMaxDepth is > MaxDepth, so we explore at maxDepth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I misread the double negation. Right now I think it works as it should i.e.

--max-depth 1 prints just the direct children of the given CID, without fetching them (as they're links from the parent and we know we can't go deeper). Thus, items at maxdepth are never dag.Get(), but they are visited and potentially printed.

// Never explore over max-depth. Never print nodes over
// max depth.
if overMaxDepth {
return false, false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about exploring nodes at max depth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't this check be higher up? No need to do anything if we're over the max depth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

}

if !recursive {
maxDepth = 1 // write only direct refs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0? If I'm not mistaken, this patch will currently explore two levels when recursive is disabled. If that's the case, can we also write a sharness test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will write extra sharness tests. But refs, as it is, prints the references from the root, that is, it prints things with maxDepth = 1. Root is 0, its children are 1. Refs doesn't print the root CID. That's why the existing tests pass, behaviour hasn't changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This looks correct.

@Stebalien
Copy link
Member

2

We can implement new features later (e.g., when we need them).

rw.seen = cid.NewSet()
// Never explore over max-depth. Never print nodes over
// max depth.
if overMaxDepth {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this just below if !rw.Unique {..}

// - We saw it higher (smaller depth) in the DAG (means we must have
// explored deep enough before)
if ok && (rw.MaxDepth < 0 || oldDepth <= depth) {
return false, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be return false, false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, docs are not clear. true means that the CID was printed before, which is the case. I will take @Stebalien proposal though and invert.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Aug 20, 2018

I have clarified the doc the comments and inverted the second return value (and moved up a check). As said, it does print items at MaxDepth, but not "over max depth". Right now, it won't dag.Get items who are at maxdepth, because their children will be over it and thus won't need printing (edit: it does get them, it does not get items over max depth).

If we're are good with the code so far, I'll proceed to create some sharness tests (already did a fair amount of manual testing).

@hsanjuan
Copy link
Contributor Author

2

We can implement new features later (e.g., when we need them).

Cluster would benefit from async, faster refs -r, as we do that to fetch content before pinning.

@hsanjuan
Copy link
Contributor Author

Sorry, it's late. It does dag.Get() items at MaxDepth. But I'm thinking this is what we want, because we want to use refs -r not only to print but to fetch blocks. So if it's printed it should be fetched (?)

@Stebalien
Copy link
Member

So if it's printed it should be fetched

I was under the impression that we didn't get all nodes (e.g., we don't need to fetch raw leaves as we know they won't have links). However, it turns out this isn't the case.

Given how we tend to use this, I think it's reasonable to actually fetch all the nodes. In the future, we can add an option that avoids this.

@Stebalien
Copy link
Member

Sorry, it's late. It does dag.Get() items at MaxDepth. But I'm thinking this is what we want, because we want to use refs -r not only to print but to fetch blocks. So if it's printed it should be fetched (?)

Actually, I'm pretty sure it'll fetch the children as well. FetchGraph will actually initiate a fetch and nd.Get will simply wait for the fetch to complete. We should probably just:

  1. Avoid going deeper once we've hit max depth (not wait until we go over the limit).
  2. Move the unexplored check after the nd.Get(...) (possibly adding some short-circuit to avoid fetching the block if we're nether printing it nor exploring it).

@hsanjuan hsanjuan force-pushed the feat/depth-limited-refs branch 3 times, most recently from 6e0f86d to ac5bea2 Compare August 23, 2018 15:27
@hsanjuan
Copy link
Contributor Author

hsanjuan commented Aug 23, 2018

@Stebalien ok, another round. I realized I was ignoring the optimization that justified using promises: we do not need to do Get when pruning on an already seen cid/branch. So now:

  1. ipld.GetDAG() returns promises for all the children of n, which we loop.
  2. We visit(): returns if we goDeeper and shouldWrite. goDeeper is false if the Cid is at MaxDepth (and not over it as before).
  3. New: We can skip doing Get on the promise if we printed the Cid before (meaning we did a Get() before) and must not go deeper. In that case we move on to next child. This justifies using promises in the first place. Otherwise we could just loop Links() and DAG.Get() every node.
  4. We do nd.Get() at this point (either the node is new/not-printed, or we are going deeper). When !Unique we must get/print all nodes anyway because we don't know if they were Get before.
  5. New: We increase count. I think this was buggy in original implementation and it stayed at 0 for recursive refs. count is not used and I'm not sure what it should count. Total number of references ? Unique number of references ? Number of times we Get() ? Whatever it is, the result may not have much sense depending on how branch pruning happens now that MaxDepth is in place.
  • We WriteRef if we must
  • If goDeeper, we go deeper, otherwise, work next Link.

So if I'm not wrong (again) this should:

  • Not Get() nodes which were already Get()
  • Not visit any cids which are over MaxDepth, thus never get any of their nodes
  • Correctly Get() nodes to the right depth.
  • Only print nodes which have been successfully Get(). I don't think we should print a Reference if we were not able to fetch its node even if we know it's cid (this should be same behaviour as before).

fingers crossed

return count, err
// Avoid "Get()" on the node. We did a Get on it before
// (we printed it) and must not go deeper. This is an
// optimization for pruned branches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is technically incorrect, unless I'm mistaken. We have already printed the node and/or we've pruned the branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rewrite it but I don't fully understand and I think it's AND only. We have already printed the node AND we've pruned the branch.

If we have not printed the node, we need to Get() it because it means we haven't seen it before, even if we are not going deeper (because we hit the MaxDepth for example).

If we are going deeper but we already printed the node, we need to Get() it to be able to make the recursive call (I think this is the case when, given a depth limit, we encounter an already explored branch higher in the tree, thus we can explore it deeper despite part of it already being printed).

So it has to be !shouldPrint && !goDeeper.

}

// We must write it because it's new, or go deeper. In any case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/write/get

nd, err := ng.Get(rw.Ctx)
if err != nil {
return count, err
}
count++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only count it if we print it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sgtm

// or is lower than last time.
// We print if it was not seen.
rw.seen[key] = depth
return !atMaxDepth, !ok
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments in this function are awesome.

❤️

@hsanjuan
Copy link
Contributor Author

@Stebalien another round. I think I'll write the sharness tests next.

@Stebalien
Copy link
Member

LGTM! (modulo tests).

Copy link
Member

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too (-tests)

This adds --max-depth to the "refs" commands and allows limiting
the fetching of refs per depth. Other than that, it works as before.

Note that clever branch pruning is only made when the --unique flag
is passed. Otherwise, we re-explore branches to the given depth.

This means that --unique costs memory, but may save time when
the DAGs contain the same sub-DAGs in several places (specially if
they are big). On the other side, not using --unique saves
memory but may involve re-exploring large sub-DAGs.

License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
License: MIT
Signed-off-by: Hector Sanjuan <hector@protocol.ai>
@hsanjuan
Copy link
Contributor Author

Thanks @Stebalien @magik6k . I have added some sharness tests now (last commit).

@Stebalien Stebalien added need/review Needs a review and removed status/in-progress In progress need_tests need/author-input Needs input from the original author labels Aug 27, 2018
@Stebalien Stebalien requested review from magik6k and removed request for magik6k August 27, 2018 19:11
@Stebalien Stebalien added RFM and removed need/review Needs a review labels Aug 27, 2018
@Stebalien
Copy link
Member

Jenkins passes and @magik6k has already reviewed (modulo tests). 🚅

@Stebalien Stebalien merged commit 66b54d9 into master Aug 27, 2018
@hsanjuan hsanjuan deleted the feat/depth-limited-refs branch August 28, 2018 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants