Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

[WIP] VerifyDepTree and DigestFromDirectory use godirwalk #1084

Closed
wants to merge 30 commits into from

Conversation

karrick
Copy link
Contributor

@karrick karrick commented Aug 29, 2017

What does this do / why do we need it?

Alters digest code to use godirwalk project. godirwalk handles symlinked directories on Windows correctly, and is faster than filepath.Walk on both unix and Windows.

What should your reviewer look out for in this PR?

  1. This adds godirwalk as a project dependency for dep.
  2. godirwalk has a very similar, but not identical, API to filepath.Walk.

Do you need help or clarification on anything?

I recall @sdboyer mentioning that there were some tests in the other portions of dep that were not working on Windows. I wonder whether those are related to the errant handling of symlinks to directories that happens with filepath on Windows, and whether godirwalk could address that.

Which issue(s) does this PR fix?

Related to #121.

case modeType&os.ModeNamedPipe != 0:
return nil // nothing more to do for this type
case modeType&os.ModeSocket != 0:
return nil // nothing more to do for this type
Copy link
Collaborator

@jmank88 jmank88 Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this block of cases be simplified a bit to something like this:

case modeType&(os.ModeDir|os.ModeDevice|os.ModeCharDevice|os.ModeNamedPipe|os.ModeSocket) != 0:
  return nil // nothing more to do for these types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close:

case modeType&(os.ModeDir|os.ModeDevice|os.ModeCharDevice|os.ModeNamedPipe|os.ModeSocket) != 0:

Pushing update, and thanks!

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

I'd like to hold off on merging this just yet.

@sdboyer
Copy link
Member

sdboyer commented Aug 30, 2017 via email

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

It looks like the speedup on Mac is significantly more than on linux. Here, walking a large directory tree is as fast as doing so with find utility:

hostname:walk-fast karrick$ du -sh ~/src
 26G    /Users/karrick/src
hostname:walk-fast karrick$ for i in 1 2 3 ; do time walk-fast ~/src | wc -l ; done                                                                                             
  542946                                     

real    0m6.048s                             
user    0m1.279s                             
sys     0m2.667s                             
  542946                                     

real    0m6.022s                             
user    0m1.267s                             
sys     0m2.633s                             
  542946                                     

real    0m6.079s                             
user    0m1.278s                             
sys     0m2.682s                             
hostname:walk-fast karrick$ for i in 1 2 3 ; do time walk-stdlib ~/src | wc -l ; done                                                                                           
  542946                                     

real    0m29.369s                            
user    0m1.994s                             
sys     0m24.758s                            
  542946                                     

real    0m29.139s                            
user    0m1.987s                             
sys     0m25.127s                            
  542946                                     

real    0m29.092s                            
user    0m2.010s                             
sys     0m24.917s                            
hostname:walk-fast karrick$ for i in 1 2 3 ; do time find ~/src | wc -l ; done
  542946

real    0m5.979s
user    0m0.310s
sys     0m2.724s
  542946

real    0m5.823s
user    0m0.313s
sys     0m2.578s
  542946

real    0m5.723s
user    0m0.307s
sys     0m2.564s

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

When compared on Linux:

[karrick@linuxBox walk-fast]$ du -sh ~/src
36G     /home/karrick/src
[karrick@linuxBox walk-fast]$ for i in 1 2 3 ; do time walk-fast ~/src | wc -l ; done
829481

real    0m5.443s
user    0m4.246s
sys     0m3.698s
829481

real    0m5.415s
user    0m4.006s
sys     0m3.932s
829481

real    0m5.433s
user    0m4.418s
sys     0m3.529s
[karrick@linuxBox walk-fast]$ for i in 1 2 3 ; do time walk-stdlib ~/src | wc -l ; done
829481

real    0m8.298s
user    0m4.789s
sys     0m6.034s
829481

real    0m8.442s
user    0m4.835s
sys     0m6.144s
829481

real    0m8.352s
user    0m4.759s
sys     0m6.084s
[karrick@linuxBox walk-fast]$ for i in 1 2 3 ; do time find ~/src | wc -l ; done
829481

real    0m1.492s
user    0m0.669s
sys     0m0.921s
829481

real    0m1.474s
user    0m0.698s
sys     0m0.873s
829481

real    0m1.495s
user    0m0.636s
sys     0m0.949s

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

The changes to dep to use godirwalk are quite trivial.

Earlier today I was running some benchmarks, and I had found a bug in godirwalk code.

I would appreciate any thoughts on godirwalk code, however.

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

On Windows, walk-fast runs in half the time that walk-stdlib runs, by the way.

@karrick
Copy link
Contributor Author

karrick commented Aug 30, 2017

I have fixed the bug, but would appreciate any extra eyeballs on the code, and thoughts folks might have.

@karrick
Copy link
Contributor Author

karrick commented Aug 31, 2017

It's interesting to note that while godirwalk is definitely faster that filepath.Walk, in situations where the main bottleneck is doing some I/O on the discovered nodes, the performance gain is often marginalized.

However, using godirwalk in DigestFromDirectory and VerifyDepTree I was still able to get a performance improvement, although because file I/O now dominates the algorithm's time, the performance improvement is maybe 5--10% on darwin, and nearly 30% on my linux box.

[karrick@macHost]$ go test -bench Flame
goos: darwin
goarch: amd64
pkg: github.com/golang/dep/internal/gps/pkgtree
BenchmarkFlameDigestFromDirectoryFilepath-8            1	26226528585 ns/op
BenchmarkFlameDigestFromDirectory-8                    1	24313511131 ns/op
PASS
ok      github.com/golang/dep/internal/gps/pkgtree	50.561s
[karrick@linuxHost pkgtree]$ go test -bench Flame
goos: linux
goarch: amd64
pkg: github.com/golang/dep/internal/gps/pkgtree
BenchmarkFlameDigestFromDirectoryFilepath-12                   1        31579248880 ns/op
BenchmarkFlameDigestFromDirectory-12                           1        20610669195 ns/op
PASS
ok      github.com/golang/dep/internal/gps/pkgtree      52.233s

@karrick
Copy link
Contributor Author

karrick commented Aug 31, 2017

Okay, the previous benchmark was created by traversing the testdata_digest directory.

As expected, when I benchmark using $GOPATH/src, the process is much more I/O focussed, and the performance gains from traversing the tree are greatly mitigated by the additional I/O load of reading the files.

On RHEL 6.x

[karrick@linuxHost pkgtree]$ go test -bench FlameDigest
goos: linux
goarch: amd64
pkg: github.com/golang/dep/internal/gps/pkgtree
BenchmarkFlameDigestFromDirectoryFilepath-12                   1        5931829679 ns/op
BenchmarkFlameDigestFromDirectory-12                           1        5864872923 ns/op
PASS
ok      github.com/golang/dep/internal/gps/pkgtree      11.837s
[karrick@linuxHost pkgtree]$ du -sh ~/go/src
1.4G    /home/karrick/go/src

On macOS Sierra

[karrick@darwinHost pkgtree]$ go test -bench FlameDigest
goos: darwin
goarch: amd64
pkg: github.com/golang/dep/internal/gps/pkgtree
BenchmarkFlameDigestFromDirectoryFilepath-8   	       1	3563277755 ns/op
BenchmarkFlameDigestFromDirectory-8           	       1	3365962160 ns/op
PASS
ok  	github.com/golang/dep/internal/gps/pkgtree	6.949s
[karrick@darwinHost pkgtree]$ du -sh ~/go/src
1.6G	/Users/karrick/go/src

@karrick
Copy link
Contributor Author

karrick commented Sep 5, 2017

Rebased my fork due to upstream changes...

@sdboyer
Copy link
Member

sdboyer commented Sep 5, 2017

gonna try to get on reviewing this a bit later this week - sorry, taken up with case sensitivity stuff rn.

@karrick
Copy link
Contributor Author

karrick commented Sep 5, 2017

Case insensitivity... Not fun. I'm happy to help out on this project still. And looking forward to getting the digest sum added to the Gopkg.lock file.

@karrick
Copy link
Contributor Author

karrick commented Sep 6, 2017

It looks like the BoltDB tests are failing, causing the builds to not pass.

https://travis-ci.org/golang/dep/jobs/272128978

@karrick
Copy link
Contributor Author

karrick commented Sep 6, 2017

The tests that use the bolt db cache are passing on localhost, so it might be a race condition in source_cache_bolt_test.go...

@karrick
Copy link
Contributor Author

karrick commented Sep 6, 2017

@jmank88,

I have performed a fresh checkout on upstream on my Windows box, and running the bolt db tests, and sometimes they pass, and sometimes they fail.

img_4178

@jmank88
Copy link
Collaborator

jmank88 commented Sep 6, 2017

Restarted jobs

carolynvs and others added 17 commits September 10, 2017 10:57
* The setup and validation for all the importer testcases are the same,
  they just need to handle calling the importer properly.
* Flag importer tests to run in parallel now that they've been cleaned up.
The closure created by t.Run wasn't copying the testcase value, so the
test was being run on the wrong values.
When a lock hint is empty, nothing should be added to the lock
Fixes an error when the git user is not set.
Use the test helper so that it can check for any problems up-front.
Disable until we can figure out why this is causing the
following error on the AppVeyor(Windows) builds:

"remote repository at https://github.com/carolynvs/deptest-importers does
 not exist, or is inaccessible"
@googlebot
Copy link
Collaborator

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored by someone other than the pull request submitter. We need to confirm that they're okay with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this State. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some small things.

are there any notable differences between the previous internal implementation and what's in godirwalk? i don't think there are, and i'd assume you'd have mentioned some, but making sure 😄

}
}

const flameIterations = 100000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain the benefit of having a fixed number of iterations, rather than relying (as BenchmarkDigestFromDirectory() does) on the testing system to either a) empirically determine an appropriate threshold and/or b) allow the user to control it via CLI args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comparing flame graphs, I want to compare how many cycles were spent in the corresponding functions for the same input. Those numbers are skewed if we run version A 500 times, and version B only 100 times. Therefore, forcing the iteration count to a particular number ensures an apples to apples comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, wfm 👍

@@ -0,0 +1,103 @@
package pkgtree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a copyright header (ugh, this should've caused our travis tests to fail...).

also, for consistency, filename should be snaky, not camel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer snaky filenames, or even better with hyphens, but most other Go projects use camel case for file names. Is there something I don't know about, or just a convention localized to dep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the uniform approach in both stdlib and the toolchain, afaik. that's the convention we're aiming to follow.

@karrick
Copy link
Contributor Author

karrick commented Sep 10, 2017

Somehow rebasing from upstream/master and pushing to my fork causes some problems. I will take a look at this later.

@sdboyer
Copy link
Member

sdboyer commented Sep 15, 2017 via email

@karrick
Copy link
Contributor Author

karrick commented Sep 16, 2017

I don't mind renaming the files, however I've been on business trip this past week and procrastinating the experience of understanding what I did wrong when I attempted my last git rebase. Thanks for your patience.

@carolynvs carolynvs removed their request for review September 19, 2017 01:28
@sdboyer
Copy link
Member

sdboyer commented Sep 20, 2017

yeah, seems like we've got some rebase weirdness. i'm happy to help you sort out a workable process - here, or pinging me on slack works, as well 😄

@sdboyer
Copy link
Member

sdboyer commented Oct 2, 2017

would it be easier to close this out and open a fresh PR?

@sdboyer
Copy link
Member

sdboyer commented Nov 15, 2017

i'm going to close this out and open a fresh PR once i have a chance to get back to this.

@sdboyer sdboyer closed this Nov 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants