Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of uncompressed tarballs #541

Open
Daniel15 opened this issue Oct 7, 2016 · 16 comments
Open

Usage of uncompressed tarballs #541

Daniel15 opened this issue Oct 7, 2016 · 16 comments

Comments

@Daniel15
Copy link
Member

Daniel15 commented Oct 7, 2016

Something to consider as a future enhancement, post-launch

Some people may want to store tarballs of all their dependencies in their source control repository, for example if they want a fully repeatable/reproducable build that does not depend on npm's servers. Storing compressed tarballs in Git or Mercurial is generally bad news. Every update to a package would result in a new copy of the entire file in the repo, which can make the repo very large. Every time you clone the repo, the full history is transferred including every previous version of all the packages, so even deleting the binary files has a lasting effect until you rewrite history to kill them.

Instead, we should try storing uncompressed tarballs (ie. .tar files). Since the tar files are mostly plain text, in theory Git/Mercurial should be able to more easily diff changes to the files if a new version of a module is added while an old version is removed and just store the delta rather than storing an entirely new blob.

Related: This was implemented in Shrinkpack: JamieMason/shrinkpack#40 and JamieMason/shrinkpack@7b2f341#comments. According to the comments on the commit, this actually sped up npm install when shrinkpack implemented it, as npm no longer needed to decompress the archive every time. This makes sense since you're removing the overhead of gzip from the installation time.

@bestander
Copy link
Member

bestander commented Oct 7, 2016

A few arguments from an internal discussion:

  • There is no difference if file is binary or text for Mercurial (what about Git?). Nonetheless larger binary files will have bigger negative impact on a source control system.
  • If file changes, mercurial tries to store only a diff in storage, that is why large mutable files are better not to be compressed - more chances you save some space.
  • Files from npm that we store in source control are saved as package-x.y.z.tar.gz, they are immutable, they never change, so the optimisation from above will never kick in
  • For example, full React Native node_modules is 200 MB and 37 000 files when installed. However in the mirror we store 800 files of 25 MB total with most .tar.gz files around 100KB. That was considered fine for the Mercurial monorepo we have at FB

Saying that, we can't deny the speed improvement when unzipping non compressed tars, so there may be a reason to consider this feature

@sebmck sebmck changed the title Investigation: Usage of uncompressed tarballs Usage of uncompressed tarballs Oct 10, 2016
@joncursi
Copy link

joncursi commented Nov 25, 2016

+1

shrinkpack has become a huge part of our development workflow. When packages are upgraded and the build is "shrinkpacked", individual tar files are created for only the new packages, and the outdated versions are automatically dropped. That's because the name of the resulting .tar files are a function of the package versions. Here's a short snapshot of what an node_shrinkwrap directory would look like:

screen shot 2016-11-25 at 7 56 08 am

You can explicitly follow the git history on this directory to figure out which dependencies were upgraded and when, i.e. react-native-animatable in this example...

screen shot 2016-11-25 at 8 05 32 am

...with quick and easy access to the backup:

screen_shot_2016-11-25_at_8_05_56_am

With shrinkpack, the diffs in GitHub are hyper reflective of the commit message and the actual changes being made. Commiting and pushing the result of a new shrinkpack is a better experience, IMO, than doing the same after a yarn pack, because as mentioned, changes are handled at the package version level, rather than repository version level. So you're only pushing up individual .tar files, which is fast, especially if you're using Git LFS, and you don't need to touch your package.json version number at all.

@bestander
Copy link
Member

@joncursi, we have offline mirror feature that does what you want https://yarnpkg.com/blog/2016/11/24/offline-mirror.
The only thing missing is cleanup that we don't do on purpose because the storage of tars is used by multiple projects

@joncursi
Copy link

joncursi commented Nov 25, 2016

@bestander very cool, thank you for sharing that blog post. I didn't catch this feature ability by reading the CLI docs. This would be a lovely addition to https://yarnpkg.com/en/docs/cli/config

I use shrinkpack local to each project, rather than globally for multiple projects. I would like to do the same with yarn, which would require old tar files to be removed when packages are upgraded. I only care about maintaining the latest working version of the package; if I need to dig up an older package version, it's always there in the git history. But I don't need or want to store it directly in the mirror forever.

My use-case is to implement the mirror less-so for offline purposes, and more-so for maintaining a concise list of package backups incase packages are suddenly unpublished from NPM. Risk control. As far as I know, that was largely the intent behind shrinkpack in the first place.

Is there a smarter way to automate package removal from the mirror when a new package version is added? Perhaps a config option in .yarnrc to specify this (feature request)? ATM it seems I have to manually do...

yarn add package@new-version && rm -rf yarn_mirror/package@old-version

Also, the same issue presents itself when removing a package from use in the repo entirely...

yarn remove package && rm -rf yarn_mirror/package-*

@bestander
Copy link
Member

@joncursi, this is a bit offtopic of this issue, better come up with an RFC discussion of what is needed.

As for the cleanup, it can be a 10 line JS/bash script you can run on the side of yarn until we implement it.
The script should be:

  • remove all files in offline mirror that are not present in yarn.lock file

@Daniel15
Copy link
Member Author

This issue is specifically for switching from compressed (.tar.gz) to uncompressed (.tar) tarballs, anything else should be discussed in a separate task 😄

@bfricka
Copy link

bfricka commented Jun 7, 2017

From an implementation standpoint, what sort of risks and level of effort would you foresee simply by making this a flag that you can pass to the CLI? Shrinkpack is written so that uncompressed tarballs are the default, but you can opt into compressed packages with a flag. What would the impact be for simply implementing the inverse behavior (opt-in to uncompressed with a flag)?

It seems like this would address the issue of potentially unpleasant changes for those already using the offline mirror to commit modules locally, while allowing the uncompressed behavior for those who don't mind aliasing a couple of yarn commands.

Edit: Even more simply, the flag could just be defined in the .yarnrc

This is actually the main thing preventing us from switching to yarn, as it already admirably solves the determinism issue and the offline mirror feature (thanks for the link, btw!) takes care of the rest. However, it leaves us with the undesirable (from our perspective) situation of committing binary packages. In our experience, Git does very well with simple tar, as most updated packages are recognized as renamed with tiny deltas, and the compression does all the rest. Thus, the actual bandwidth used is dramatically lower.

@bestander
Copy link
Member

bestander commented Jun 7, 2017 via email

@Daniel15
Copy link
Member Author

Daniel15 commented Jun 7, 2017 via email

@bestander
Copy link
Member

bestander commented Jun 7, 2017 via email

@webuniverseio
Copy link

Hi @bestander, we use git with bitbucket & npm + shrinkwrap on some projects. Here is what it looks like when minor version of the tar changes:
image

Here are sample tar files for package from screenshot that was tracked as renamed:
tars.zip

Thanks

@Daniel15
Copy link
Member Author

Daniel15 commented Jun 7, 2017

Although someone needs to show that this advanced mercurial/git tracking would happen on a real example then before we consider this change, right?

I've been meaning to test it out, I just haven't had time to do so.

@bfricka
Copy link

bfricka commented Jul 5, 2017

Hey there! It's been awhile, and since you're busy, I thought I'd make this as painless as possible.

Check out this shrinkpack tar proof of concept

@bestander
Copy link
Member

This seems like a reasonable idea after all.

So how would it work?

  1. (if file is missing in offline mirror) download tar.gz from registry
  2. unzip and copy tar file to offline mirror
  3. unpack tar to cache
  4. if prune-offline-mirror is enabled and a tarball of a package was added to offline mirror and a another version was removed then register add/remove with git/hg mv

Results:
A. Potential CPU wins because step 2 will be skipped when installing from offline mirror.
B. Space wins if tarball contents are similar at step 4
C. Checking in unzipped tarballs gives a negative impact on repo size
D. Step 4 seems a bit complex with all sorts of edge cases

So if A + B > C + D then why not?
A, B and C can be measured, although D subjective.

@bfricka
Copy link

bfricka commented Oct 5, 2017

Bumpity bump! I can work on this if you guys want?

@bestander
Copy link
Member

@bfricka, of course, give it a try.
We would need to see a few real life examples with the impact this feature provides though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants