Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

gps: Refactor prune and filesystem functions #1219

Merged
merged 11 commits into from
Dec 18, 2017

Conversation

ibrasho
Copy link
Collaborator

@ibrasho ibrasho commented Sep 24, 2017

What does this do / why do we need it?

This PR moves filesystemState from being a test struct to become part of gps. It's used to optimize the prune function by calculating the filesystem state and determining the files to prune in-memory.

addition: This PR also includes the remaining plumbing required to make prune part of ensure.

What should your reviewer look out for in this PR?

Correctness and optimizations as usual. 😁

Do you need help or clarification on anything?

Any suggestion on how to handle striping vendor directories?

Which issue(s) does this PR fix?

One more step towards #944

@ibrasho ibrasho self-assigned this Sep 24, 2017
@ibrasho ibrasho changed the title internal/gps: refactor prune and filesystem functions [WIP] internal/gps: refactor prune and filesystem functions Sep 24, 2017
@ibrasho ibrasho force-pushed the optimize-gps-prune branch 2 times, most recently from 4e92699 to 4db54a5 Compare September 24, 2017 16:46

// calculateFilesystemState returns a filesystemState based on the state of
// the filesystem on root.
func calculateFilesystemState(root string) (filesystemState, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using calculate a lot in this context but I'm not very comfortable with it. Any better suggestions?

Copy link
Collaborator

@jmank88 jmank88 Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deriveFilesystemState, analyzeFilesystem, or even just newFilesystemState?

@sdboyer
Copy link
Member

sdboyer commented Sep 24, 2017 via email

jmank88
jmank88 previously approved these changes Sep 28, 2017
Copy link
Collaborator

@jmank88 jmank88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I wonder how the API would feel with these functions as methods on filesystemState instead.

I also noticed some of the logger args are unused, if you want to clean up while you're in here.

@ibrasho ibrasho force-pushed the optimize-gps-prune branch 3 times, most recently from b76ed0f to 4d7004b Compare September 29, 2017 09:05
@ibrasho
Copy link
Collaborator Author

ibrasho commented Sep 29, 2017

I wonder how the API would feel with these functions as methods on filesystemState instead.

Hmm... 🤔 This could look nicer, but I didn't want to couple filesystemState to prune. I started by creating a new struct with methods, but then opted to use the current filesystemState with some modification.

I also noticed some of the logger args are unused, if you want to clean up while you're in here.

I actually planned to use those to log all files that will be deleted. This was added in the last push.

But I'm a bit reluctant now about logging all these lines.
This will print a lot of lines for big projects. Which might make verbose output very verbose.

@ibrasho ibrasho force-pushed the optimize-gps-prune branch 11 times, most recently from c4d36a0 to 92e138a Compare September 30, 2017 08:09
sdboyer
sdboyer previously requested changes Oct 1, 2017
@@ -107,152 +111,141 @@ func pruneNestedVendorDirs(baseDir string) error {
return filepath.Walk(baseDir, stripVendor)
}

// pruneUnusedPackages deletes unimported packages found within baseDir.
// pruneVendorDirs deletes all nested vendor directories within baseDir.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep the commented funcs?

Copy link
Collaborator Author

@ibrasho ibrasho Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why this is a [WIP]. 😁

I'm trying to refactor strip vendor funcs to use the filesystemState.

for _, path := range toDelete {
logger.Printf(" * %s", path)

if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we ordering the deletion of directories such that we don't leave unnecessary empty ones around, but still do it in the correct order so that directories are only removed once all of their contents have been removed?

Copy link
Collaborator Author

@ibrasho ibrasho Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the most annoying parts of pruning for me.

I opted to only delete files in the new prune functions, and empty directories are kept behind.
I think empty directories are a minor issue atm since git and hg ignore empty directories by default (not sure how bzr handles them).

We'll need to replace filesystemState with a hierarchical representation to allow determining which directories are empty efficiently or come up with a different solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to replace filesystemState with a hierarchical representation to allow determining which directories are empty efficiently or come up with a different solution.

If the paths are normalized, can the list just be sorted so that everything is deleted bottom-up? Then whenever the next item is a new dir, empty-check the previous.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dirs are not added to the list currently. The logic to determine when to add a dir to the list will be more involved with the current struct.

Copy link
Collaborator

@jmank88 jmank88 Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't need to be in the list for what I was describing. I just meant that if the next file is in a different dir then the previous file, then it was the last file to delete from that dir (assuming a proper normalized sort), so empty-check it (and possibly its parent, and so on). This doesn't account for directories which were already empty before any pruning, but that seems like a weird edge case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think empty directories as a minor issue atm

we may be able to skate by in an initial implementation without them because of git/hg's treatment of empty dirs, but i suspect we'll need to catch up rather quickly. beyond the general annoyance empty dirs will cause for folks, i suspect that correct implementation of #1113 will also demand that we do something a little more sophisticated here, anyway.

i don't have a clear implementation in mind yet, but i have some vague ideas swirling around regarding prefix-based iteration/exploration of a trie.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've 2 approaches in mind to do this:

  1. Loop through all dirs in the filesystemState struct and check if each dir is empty or not. This can be done once after all options are done, or after each option. I prefer doing it once at the end.
  2. After each option deletes its files, determine the touched dirs from the slice containing the toDelete files, and loop through only those dirs and check for emptiness.

The key operation that I'm trying to minimize is opening the file to use os.File.Readdirname.
I can keep looping over the slices to find files in a dir, but a prefix tree would make that easier. Giving filesystemState hierarchy could be even better and more specialized.

But I'm trying to resist the temptation to change filesystemState to a more complex struct that provides more functionality by finding a simpler solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just off the cuff, but maybe it's as simple as a sort function, where records represent dirs and their contents, and dirs that are subpaths of other dir records are moved to the front? that way you can proceed linearly through the sorted slice, assured that you will never visit a dir before any of its ancestors.

@sdboyer
Copy link
Member

sdboyer commented Oct 1, 2017

@jmank88 would you mind dismissing your review until this is no longer WIP?

We now delete anything that looks like a symlink if its called
"vendor" while pruning.

Hopefully, you didn't make a bad decision by relying on the magical
properties of symlinks.

Signed-off-by: Ibrahim AshShohail <ibra.sho@gmail.com>
@ibrasho
Copy link
Collaborator Author

ibrasho commented Dec 11, 2017

??? Legal files remain in some otherwise empty nested directories. For example with project github.com/golang/freetype (packages = [".","raster","truetype"]), vendor/github.com/golang/freetype/testdata/COPYING is not pruned, which is a change from the behavior of dep ensure && dep prune on master.

I discussed this with @sdboyer and this is the approach we want to keep just because we can never be sure about the scope of those files.

👎 Pruned packages leave behind an empty directory.

@ChrisHines This should be solved now. Could you please try? If this still doesn't keep some directories (and it doesn't fail in the process) I would really appreciate an example project where this occurs.

@ChrisHines
Copy link
Contributor

@ibrasho Pruning unused packages still leaves empty directories for pruned packages that are sub-directories of a bigger repository. Here is a tiny reproducer for you: https://github.com/ChrisHines/dep-test. If you clone that repo and run dep ensure with the current revision of this PR you should see many empty directories left behind.

When I do it on my machine I see the following empty directories under vendor:

  • vendor\github.com\ajstarks\svgo\float
  • vendor\github.com\golang\freetype\cmd
  • vendor\github.com\golang\freetype\example
  • vendor\github.com\llgcode\draw2d\output
  • vendor\github.com\llgcode\draw2d\samples
  • vendor\golang.org\x\image\cmd
  • vendor\golang.org\x\image\example
  • vendor\golang.org\x\image\font\gofont
  • vendor\golang.org\x\image\font\testdata
  • vendor\golang.org\x\image\webp
  • vendor\gonum.org\v1\plot\internal\cmpimg
  • vendor\gonum.org\v1\plot\palette\moreland
  • vendor\gonum.org\v1\plot\plotter
  • vendor\gonum.org\v1\plot\tools
  • vendor\gonum.org\v1\plot\vg\vgtex

Signed-off-by: Ibrahim AshShohail <ibra.sho@gmail.com>
@sdboyer
Copy link
Member

sdboyer commented Dec 12, 2017

can we feasibly add a test case to cover what's represented by @ChrisHines repository example there?

@ibrasho
Copy link
Collaborator Author

ibrasho commented Dec 12, 2017

@ChrisHines Thanks for the repo. Found the issue and fixed it now. Turns out the nested empty dirs were not handled properly.

@sdboyer Sure. The basic issue was very simple and a small example can cover it.

@ChrisHines
Copy link
Contributor

@ibrasho It looks like it's working perfectly now. I don't see any more problems.

Signed-off-by: Ibrahim AshShohail <ibra.sho@gmail.com>
@sdboyer
Copy link
Member

sdboyer commented Dec 15, 2017

wow. codeclimate has gotta go.

@ibrasho
Copy link
Collaborator Author

ibrasho commented Dec 15, 2017

@sdboyer Any update on the fix from their side?

@sdboyer
Copy link
Member

sdboyer commented Dec 15, 2017

no and now they've got all these additional issues that are much less clear to fix - e.g. the code duplication complaints for table based tests. I don't see there being an easy solution there; I'd much rather we just move back to something strictly for code coverage. I think I was using codecov before with gps.

@alexkay
Copy link

alexkay commented Dec 15, 2017

@ibrasho (and others!), than you so much for this PR! I'm moving a decently sized Go monorepo at $WORK from using godep to dep, and this change allowed us to keep the ./vendor diff minimal.

One edge case I'm seeing: Some larger libraries have license and legal files not only in the root, but also in the nested packages. We should probably delete such files if the nested package is not otherwise used.

An example library if you need it for troubleshooting is git.apache.org/thrift.git.

Gopkg.toml fragment:

[[constraint]]
  name = "git.apache.org/thrift.git"
  revision = "fe9222a6ec20d23d9cfd3ec9c793887f7212b313"

Gopkg.lock fragment:

[[projects]]
  name = "git.apache.org/thrift.git"
  packages = ["lib/go/thrift"]
  revision = "fe9222a6ec20d23d9cfd3ec9c793887f7212b313"

find vendor/git.apache.org/thrift.git -type f fragment, after running dep ensure:

vendor/git.apache.org/thrift.git/contrib/fb303/LICENSE
vendor/git.apache.org/thrift.git/debian/copyright
vendor/git.apache.org/thrift.git/LICENSE
vendor/git.apache.org/thrift.git/NOTICE
vendor/git.apache.org/thrift.git/lib/dart/LICENSE_HEADER
vendor/git.apache.org/thrift.git/lib/go/thrift/socket.go
vendor/git.apache.org/thrift.git/lib/go/thrift/exception.go
(many more .go files)
vendor/git.apache.org/thrift.git/lib/go/thrift/messagetype.go
vendor/git.apache.org/thrift.git/lib/hs/LICENSE

@sdboyer
Copy link
Member

sdboyer commented Dec 15, 2017

@alexkay so glad this PR is helping you make the jump! I realize it can be a bit annoying, but because there is no well-defined relationship between legal files and the code they are intended to cover, in order to err on the side of not pissing off lawyers unnecessarily, we're opting to just always keep those files.

@ibrasho ibrasho mentioned this pull request Dec 16, 2017
@carolynvs carolynvs changed the base branch from master to prune-release-prep December 18, 2017 18:32
@carolynvs
Copy link
Collaborator

I have updated the target branch for this PR to the prune-release-prep branch.

@sdboyer
Copy link
Member

sdboyer commented Dec 18, 2017

ok, we're good on this - pulling it into the release prep branch! 🎉 🎉

@sdboyer sdboyer merged commit bf5b9fc into golang:prune-release-prep Dec 18, 2017
@glerchundi
Copy link

such a great job!

@sdboyer which is the ETA for the next release including these changes?

@sdboyer
Copy link
Member

sdboyer commented Jan 5, 2018 via email

@glerchundi
Copy link

don't worry, would be a huge win so it's worth waiting! 👍

@TobiasBales
Copy link

Did this get released yet? I can't seem to find it

@sdboyer
Copy link
Member

sdboyer commented Feb 2, 2018

yes, it now happens automatically when you set appropriate values in Gopkg.toml. more info in release notes and on the new docs site and in the release announcement

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants