Feat: depth limited refs -r #5337

hsanjuan · 2018-08-03T10:31:16Z

This adds --max-depth to the "refs" commands and allows limiting
the fetching of refs per depth. Other than that, it works as before.

Note that clever branch pruning is only made when the --unique flag
is passed. Otherwise, we re-explore branches to the given depth.

First minimal approach. I wonder if we could utilize the EnumerateChildren functions here instead, @Stebalien ? Doing the printing inside the custom visit function. I'm not sure if there are reasons why the DAG traversal logic was re-implemented separately for this command.

Also, now or later, I would like to do an --async version of this, so it would be very easy if we re-use EnumerateChildrenAsync().

hsanjuan · 2018-08-03T10:31:56Z

Note: i am missing some sharness tests still. Will do that when we have a final approach.

Stebalien · 2018-08-07T01:25:05Z

core/commands/refs.go

+// true otherwise. The second return argument indicates whether the Cid was seen
+// before.
+func (rw *RefWriter) visit(c *cid.Cid, depth int) (bool, bool) {
+	if rw.seen == nil {


I believe disabling unique by default is a memory optimization. If it's disabled, we should avoid using memory linear in the number of keys. We should probably have check for the Unique flag up-top and, in that case, only check the depth (don't store anything).

It may also be simpler to have one boolean indicate that we should continue traversing and the other indicate that we should return the CID to the user. That may simplify some of the other logic. It'll also allow us to say "return this but don't traverse it which, unless I'm mistaken, should save us a bit of work (possibly saving us from fetching a node we don't need to traverse).

Stebalien · 2018-08-07T01:32:35Z

I'm not sure if there are reasons why the DAG traversal logic was re-implemented separately for this command.

I believe this command came before the EnumerateChildren functions. We could probably re-use that logic (although we should probably use the non-async one unless the user passes some --in-order=false flag as users may be relying on traversal order...).

hsanjuan · 2018-08-16T12:06:38Z

@Stebalien:

I think I have addressed both of your comments. Traversal logic is slightly simpler now.
I tried to use merkledag.EnumerateChildren but doesn't work maintaining the custom --format options (needs parent and linkname, apart from the curren Cid), so I left things as they were. Do you think we should provide an async/--in-oder=false method where we disregard the custom --format flag? I'd rather not duplicate the EnumerateChildrenAsync logic. (this would be different PR).

Stebalien · 2018-08-20T20:27:38Z

core/commands/refs.go

-		return 0, nil
+// visit returns two values:
+// - first indicates if we should keep traversing the DAG.
+// - second indicates if the given Cid should be printed to the user.


"indicates" is ambiguous. In the first case, it means "is true" and, in the second case, "is false".

Personally, I'd rather:

Say "is true" (explicitly).

Invert the second case (i.e., true means print).

Stebalien · 2018-08-20T20:29:00Z

core/commands/refs.go

 	nc := n.Cid()

 	var count int
 	for i, ng := range ipld.GetDAG(rw.Ctx, rw.DAG, n) {
 		lc := n.Links()[i].Cid
-		if rw.skip(lc) {
-			continue
+		unexplored, written := rw.visit(lc, depth+1) // The children are at depth+1


"written" isn't quite correct. Really, it means "don't write" (we may not have already written it, unless I'm mis-reading the code).

Note: if we invert this case, this'll obviously become "shouldWrite" or something like that. I'm just dropping a comment here so we don't miss it.

Stebalien · 2018-08-20T20:35:49Z

core/commands/refs.go

+	// We do not track a set of visited nodes in this case.
+	// We do not print anything too deep though.
+	if !rw.Unique {
+		return !overMaxDepth, overMaxDepth


Don't we want to not explore nodes at max depth as well?

overMaxDepth is > MaxDepth, so we explore at maxDepth.

Sorry, I think I misread the double negation. Right now I think it works as it should i.e.

--max-depth 1 prints just the direct children of the given CID, without fetching them (as they're links from the parent and we know we can't go deeper). Thus, items at maxdepth are never dag.Get(), but they are visited and potentially printed.

Stebalien · 2018-08-20T20:36:01Z

core/commands/refs.go

+	// Never explore over max-depth. Never print nodes over
+	// max depth.
+	if overMaxDepth {
+		return false, false


Ditto about exploring nodes at max depth.

Also, shouldn't this check be higher up? No need to do anything if we're over the max depth.

Yes, correct.

Stebalien · 2018-08-20T20:38:35Z

core/commands/refs.go

+		}
+
+		if !recursive {
+			maxDepth = 1 // write only direct refs


Should this be 0? If I'm not mistaken, this patch will currently explore two levels when recursive is disabled. If that's the case, can we also write a sharness test?

I will write extra sharness tests. But refs, as it is, prints the references from the root, that is, it prints things with maxDepth = 1. Root is 0, its children are 1. Refs doesn't print the root CID. That's why the existing tests pass, behaviour hasn't changed.

I see. This looks correct.

Stebalien · 2018-08-20T20:41:03Z

2

We can implement new features later (e.g., when we need them).

magik6k · 2018-08-20T20:40:34Z

core/commands/refs.go

-		rw.seen = cid.NewSet()
+	// Never explore over max-depth. Never print nodes over
+	// max depth.
+	if overMaxDepth {


I'd put this just below if !rw.Unique {..}

magik6k · 2018-08-20T20:50:39Z

core/commands/refs.go

+	//   - We saw it higher (smaller depth) in the DAG (means we must have
+	//     explored deep enough before)
+	if ok && (rw.MaxDepth < 0 || oldDepth <= depth) {
+		return false, true


Shouldn't this be return false, false?

sorry, docs are not clear. true means that the CID was printed before, which is the case. I will take @Stebalien proposal though and invert.

hsanjuan · 2018-08-20T23:46:33Z

I have clarified the doc the comments and inverted the second return value (and moved up a check). As said, it does print items at MaxDepth, but not "over max depth". ~~Right now, it won't dag.Get items who are at maxdepth, because their children will be over it and thus won't need printing~~ (edit: it does get them, it does not get items over max depth).

If we're are good with the code so far, I'll proceed to create some sharness tests (already did a fair amount of manual testing).

hsanjuan · 2018-08-20T23:53:23Z

2

We can implement new features later (e.g., when we need them).

Cluster would benefit from async, faster refs -r, as we do that to fetch content before pinning.

hsanjuan · 2018-08-21T01:12:30Z

Sorry, it's late. It does dag.Get() items at MaxDepth. But I'm thinking this is what we want, because we want to use refs -r not only to print but to fetch blocks. So if it's printed it should be fetched (?)

Stebalien · 2018-08-21T18:26:04Z

So if it's printed it should be fetched

I was under the impression that we didn't get all nodes (e.g., we don't need to fetch raw leaves as we know they won't have links). However, it turns out this isn't the case.

Given how we tend to use this, I think it's reasonable to actually fetch all the nodes. In the future, we can add an option that avoids this.

Stebalien · 2018-08-21T21:51:25Z

Sorry, it's late. It does dag.Get() items at MaxDepth. But I'm thinking this is what we want, because we want to use refs -r not only to print but to fetch blocks. So if it's printed it should be fetched (?)

Actually, I'm pretty sure it'll fetch the children as well. FetchGraph will actually initiate a fetch and nd.Get will simply wait for the fetch to complete. We should probably just:

Avoid going deeper once we've hit max depth (not wait until we go over the limit).
Move the unexplored check after the nd.Get(...) (possibly adding some short-circuit to avoid fetching the block if we're nether printing it nor exploring it).

hsanjuan · 2018-08-23T15:40:33Z

@Stebalien ok, another round. I realized I was ignoring the optimization that justified using promises: we do not need to do Get when pruning on an already seen cid/branch. So now:

ipld.GetDAG() returns promises for all the children of n, which we loop.
We visit(): returns if we goDeeper and shouldWrite. goDeeper is false if the Cid is at MaxDepth (and not over it as before).
New: We can skip doing Get on the promise if we printed the Cid before (meaning we did a Get() before) and must not go deeper. In that case we move on to next child. This justifies using promises in the first place. Otherwise we could just loop Links() and DAG.Get() every node.
We do nd.Get() at this point (either the node is new/not-printed, or we are going deeper). When !Unique we must get/print all nodes anyway because we don't know if they were Get before.
New: We increase count. I think this was buggy in original implementation and it stayed at 0 for recursive refs. count is not used and I'm not sure what it should count. Total number of references ? Unique number of references ? Number of times we Get() ? Whatever it is, the result may not have much sense depending on how branch pruning happens now that MaxDepth is in place.

We WriteRef if we must
If goDeeper, we go deeper, otherwise, work next Link.

So if I'm not wrong (again) this should:

Not Get() nodes which were already Get()
Not visit any cids which are over MaxDepth, thus never get any of their nodes
Correctly Get() nodes to the right depth.
Only print nodes which have been successfully Get(). I don't think we should print a Reference if we were not able to fetch its node even if we know it's cid (this should be same behaviour as before).

fingers crossed

Stebalien · 2018-08-23T18:26:19Z

core/commands/refs.go

-			return count, err
+		// Avoid "Get()" on the node. We did a Get on it before
+		// (we printed it) and must not go deeper. This is an
+		// optimization for pruned branches.


This comment is technically incorrect, unless I'm mistaken. We have already printed the node and/or we've pruned the branch.

I'm going to rewrite it but I don't fully understand and I think it's AND only. We have already printed the node AND we've pruned the branch.

If we have not printed the node, we need to Get() it because it means we haven't seen it before, even if we are not going deeper (because we hit the MaxDepth for example).

If we are going deeper but we already printed the node, we need to Get() it to be able to make the recursive call (I think this is the case when, given a depth limit, we encounter an already explored branch higher in the tree, thus we can explore it deeper despite part of it already being printed).

So it has to be !shouldPrint && !goDeeper.

Stebalien · 2018-08-23T18:27:01Z

core/commands/refs.go

 		}

+		// We must write it because it's new, or go deeper. In any case


s/write/get

Stebalien · 2018-08-23T18:27:44Z

core/commands/refs.go

 		nd, err := ng.Get(rw.Ctx)
 		if err != nil {
 			return count, err
 		}
+		count++


Shouldn't we only count it if we print it?

Stebalien · 2018-08-23T18:33:19Z

core/commands/refs.go

+	// or is lower than last time.
+	// We print if it was not seen.
+	rw.seen[key] = depth
+	return !atMaxDepth, !ok


The comments in this function are awesome.

❤️

hsanjuan · 2018-08-24T11:21:37Z

@Stebalien another round. I think I'll write the sharness tests next.

Stebalien · 2018-08-24T21:12:48Z

LGTM! (modulo tests).

magik6k

LGTM too (-tests)

This adds --max-depth to the "refs" commands and allows limiting the fetching of refs per depth. Other than that, it works as before. Note that clever branch pruning is only made when the --unique flag is passed. Otherwise, we re-explore branches to the given depth. This means that --unique costs memory, but may save time when the DAGs contain the same sub-DAGs in several places (specially if they are big). On the other side, not using --unique saves memory but may involve re-exploring large sub-DAGs. License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

hsanjuan · 2018-08-27T18:39:05Z

Thanks @Stebalien @magik6k . I have added some sharness tests now (last commit).

Stebalien · 2018-08-27T19:29:25Z

Jenkins passes and @magik6k has already reviewed (modulo tests). 🚅

hsanjuan self-assigned this Aug 3, 2018

hsanjuan requested a review from Stebalien August 3, 2018 10:31

hsanjuan requested a review from Kubuxu as a code owner August 3, 2018 10:31

ghost added the status/in-progress In progress label Aug 3, 2018

Stebalien reviewed Aug 7, 2018

View reviewed changes

Stebalien added need/author-input Needs input from the original author need_tests labels Aug 16, 2018

hsanjuan force-pushed the feat/depth-limited-refs branch from 0a0ae24 to f2c84c3 Compare August 16, 2018 11:50

Mr0grog mentioned this pull request Aug 16, 2018

[WIP] unixfs: decouple the DAG traversal logic from the DAG reader #5257

Closed

Stebalien requested changes Aug 20, 2018

View reviewed changes

magik6k reviewed Aug 20, 2018

View reviewed changes

hsanjuan force-pushed the feat/depth-limited-refs branch from 4d49f37 to 6052f33 Compare August 20, 2018 23:32

Stebalien removed the need_tests label Aug 21, 2018

hsanjuan force-pushed the feat/depth-limited-refs branch 3 times, most recently from 6e0f86d to ac5bea2 Compare August 23, 2018 15:27

hsanjuan force-pushed the feat/depth-limited-refs branch from ac5bea2 to d1be006 Compare August 23, 2018 15:53

Stebalien reviewed Aug 23, 2018

View reviewed changes

Stebalien requested changes Aug 23, 2018

View reviewed changes

Stebalien added the need_tests label Aug 24, 2018

magik6k reviewed Aug 25, 2018

View reviewed changes

hsanjuan force-pushed the feat/depth-limited-refs branch from ae896f4 to 35a02ff Compare August 27, 2018 18:31

hsanjuan added 2 commits August 27, 2018 20:34

Add sharness tests for the refs -r command using --max-depth

fe89e2e

License: MIT Signed-off-by: Hector Sanjuan <hector@protocol.ai>

hsanjuan force-pushed the feat/depth-limited-refs branch from 35a02ff to fe89e2e Compare August 27, 2018 18:35

Stebalien approved these changes Aug 27, 2018

View reviewed changes

Stebalien added need/review Needs a review and removed status/in-progress In progress need_tests need/author-input Needs input from the original author labels Aug 27, 2018

Stebalien requested review from magik6k and removed request for magik6k August 27, 2018 19:11

Stebalien added RFM and removed need/review Needs a review labels Aug 27, 2018

Stebalien merged commit 66b54d9 into master Aug 27, 2018

hsanjuan deleted the feat/depth-limited-refs branch August 28, 2018 09:09

		}

		// We must write it because it's new, or go deeper. In any case

Feat: depth limited refs -r #5337

Feat: depth limited refs -r #5337

Conversation

hsanjuan commented Aug 3, 2018

hsanjuan commented Aug 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Aug 7, 2018

hsanjuan commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Aug 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented Aug 20, 2018 • edited Loading

hsanjuan commented Aug 20, 2018

hsanjuan commented Aug 21, 2018

Stebalien commented Aug 21, 2018

Stebalien commented Aug 21, 2018

hsanjuan commented Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented Aug 24, 2018

Stebalien commented Aug 24, 2018

magik6k left a comment

Choose a reason for hiding this comment

hsanjuan commented Aug 27, 2018

Stebalien commented Aug 27, 2018

hsanjuan commented Aug 20, 2018 •

edited

Loading

hsanjuan commented Aug 23, 2018 •

edited

Loading