Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for speedier, accurate git+file:// links #5825

Open
jonahbeckford opened this issue Feb 7, 2024 · 9 comments
Open

Proposal for speedier, accurate git+file:// links #5825

jonahbeckford opened this issue Feb 7, 2024 · 9 comments

Comments

@jonahbeckford
Copy link
Contributor

jonahbeckford commented Feb 7, 2024

I came across https://stackoverflow.com/a/50059607/21513816 and have started to use it in my own projects.

I think it will help people who are surprised that you have to commit things for the files to be visible. Or they use the working-dir option which for me is insanely expensive when it encounters build directories, local opam switches, etc.

Basically do:

  1. git ls-files --exclude-standard -oi --directory gets a list of files and directories that should be excluded based on the .gitignore scattered throughout the project. The command has a nonzero time cost but is reasonably small.
  2. Either:
    • Convert the list of exclusions into regex literals (ex. .ci/ becomes [.]ci/). And then use those regex literals in rsync or whatever copying program that is active inside opam -OR-
    • Do the copying completely in OCaml which can trivially respect the path exclusions from step 1.

(I'm sure someone else already thought of this, but I couldn't find an issue. Feel free to close if duplicate or unwise)

@rjbou
Copy link
Collaborator

rjbou commented Feb 8, 2024

Working directory option is not supposed to copy git directories, build, or opam switches. If it does, it is a bug.

What we want for working-dir is to retrieve current directory layout: copy new files, modified files, and remove deleted ones. No matter if it is versioned/committed or not.

For context, the current syncing mechanism for working-dir is:
0. we need to sync git controlled directory A in internal opam directory B

  1. perform the git synchronisation: fetch A and update it in B
  2. delete in B all files that are not present in A
  3. construct a rsync-file that contains:
    • files in A - versioned files in A + modified/added files in A - excluded files
      which means roughly
    • ls in A - git ls-files in A + git status | grep 'M|A|R|C' in A - ["_opam"; "_build"; ".git"; "_darcs"; ".hg"]
  4. perform the rsync --file-from=rsync-file A/ B/

We don't clean B directory between 2 synchronisation, like that the number of files to copy (using git or rsync) is smaller.

To prepare step 2, we retrieve the full list of files and directories in A, without exclusion list. We read then all files in _build and _opam, even thought we don't need any information from these directories.

We can study doing that based on gitignore, if it handles all cases.

@dbuenzli
Copy link
Contributor

dbuenzli commented Feb 8, 2024

Working directory option is not supposed to copy git directories

Are you sure ?

In pin mode the build system may expect to have the VCS in place if only to be able to git describe --dirty and watermark the build accordingly.

Otherwise you will need to have variables that describe the kind of pin and use them in build instructions, which I personally would find micro managing, if not annoying: %{dev}% is fine for that and covers both git packages and pins. I would hate to have a new case to treat.

@rjbou
Copy link
Collaborator

rjbou commented Feb 8, 2024

Working directory option is not supposed to copy git directories

Are you sure ?

Sorry, my sentence wasn't clear. Working dir is not supposed to copy via rsync git, build, and opam directories (see exclusion list). It begins with a git sync, so the .git directory is present in opam internal source of the package.

@jonahbeckford
Copy link
Contributor Author

jonahbeckford commented Feb 8, 2024

Working directory option is not supposed to copy git directories, build, or opam switches. If it does, it is a bug.

... excluded files which means roughly ... ["_opam"; "_build"; ".git"; "_darcs"; ".hg"]

TLDR: Today's algorithm excludes _opam and _build directories; clearly the intent is to exclude build directories. But _opam and _build is just a heuristic that is subject to false negatives. And expanding the heuristic can lead to false positives. Today's algorithm has an unfixable design flaw.

False negatives

  • I maintain dkml-workflows so that maintainers have an MSVC alternative to setup-ocaml (among other things). It is the first thing I recommend maintainers use to see if they have a problem with native Windows. dkml-workflows has a local development mode to simulate what will be happening inside GitHub Actions and GitLab CI on your desktop; huge productivity saver! That mode places a ton of large files into the .ci directory. But:
    • .ci is not in the exclusion list. opam could not have anticipated .ci.
  • I do a lot of mixed language development. A lot of that is just OCaml/C FFI because that is what is most often broken on Windows. And I also do a lot of Java + OCaml. But:
    • More likely than not these days an external C library uses CMake as its build system. I can either compile directly using CMake (the conventional name of the CMake build directory is build, and conventional variants use build_*) or get a prepackage build from vcpkg (which has its own set of build directories). opam could not have anticipated build, build_* and installed.
    • For Java, Gradle has huge build and .cxx (Android Studio) directories scattered throughout the project. Ditto for Maven and its target directories. opam could not have anticipated build, .cxx and target.
  • I can't tell you how often I do wget xxx.tar.gz into a source directory. Since I work on tens of projects at once, I like keeping all relevant assets (source archives, docs, release binaries, etc.) inside their respective project folders. That helps me keep my sanity. But:
    • I tuck these assets inside the project root folder (so they are visible reminders I have to do something) or sometimes in directories defined by the project owner's .gitignore. opam could not have anticipated that .tar.gz (etc.) assets are perfectly fine when doing local development.

False positives

Let's pretend opam accepts the build and target as file patterns to exclude. But build and target are completely valid directory names for source code! We'd be excluding legitimate source code.

Case Study from your own projects

For a git based project, the project owner has already expressed what needs to be excluded inside .gitignore files. In a pure OCaml project (here is one from @dbuenzli at https://github.com/dbuenzli/qrc/blob/master/.gitignore and another from @rjbou at https://github.com/rjbou/orb/blob/master/.gitignore) we all exclude Dune build directories. More complex projects exclude more.

I think the qrc one is interesting because it covers both the false negative and the false positive case:

  1. _b0 is a false negative.
  2. _opam is missing (a false positive). Apparently, Daniel you don't use local opam switches. But that is fine ... if I'm doing a local pin of one of your packages I would just add _opam/ to your .gitignore locally. Not because opam supported .gitignore but because my IDE (Visual Studio Code, etc.) would be a mess if I didn't.

Conclusion

Let's just use the .gitignore exclusions. It expresses the will of the project owner and is easy to amend by a local opam pinner. In contrast, waiting 6 months to a year for an opam maintainer to add an exclusion is not right.

Also, the approach I mentioned is simpler to implement and describe. Just rsync everything except what git tells you to ignore. Notice that there is no need to handle the .git/ edge case; git does not ignore .git/, nor should opam!

Also, the --working-dir option can disappear, reducing the user (negative) surprise they have today.

@dbuenzli
Copy link
Contributor

dbuenzli commented Feb 8, 2024

Also, the --working-dir option can disappear, reducing the user (negative) surprise they have today.

I'm not sure to what you refer to here. But having to commit things to git so that "they become visible" is the right default.

The negative surprise is when you have pins a bit everywhere you checkout another branch of you repo, go somewhere else that happens to need an opam update and has a pin on that repo and everything breaks havoc because your pins are all relying on global mutable state.

@jonahbeckford
Copy link
Contributor Author

But having to commit things to git so that "they become visible" is the right default.

You succinctly described the surprise. Real scenario that has happened to me: You spend hours thinking your changes should have fixed a bug but didn't. Then miraculously you find out (how?) you should be committing things to git for them to be visible to the build system. Let's repeat that so it is clear: you have to commit changes to source control in your local projects for them to be visible to the build system. I would be willing to bet you a coffee and a donut that doing a survey of developers asking "would you be surprised with that italicized statement?", at least 75% of them would say "yes".

That is today's default. At least somewhere opam gives a warning (if you are paying attention to warnings in a voluminous stream of logs) that you may want to use the non-default --working-dir. For me that leads to the second surprise ... the --working-dir exclusions are hardcoded.

My hope is the simplified UX in this proposal gets rid of both surprises.

The negative surprise is when you have pins a bit everywhere you checkout another branch of you repo, go somewhere else that happens to need an opam update and has a pin on that repo and everything breaks havoc because your pins are all relying on global mutable state.

Completely agree. And I hope that is tracked somewhere.

@dbuenzli
Copy link
Contributor

dbuenzli commented Feb 8, 2024

Let's repeat that so it is clear: you have to commit changes to source control in your local projects for them to be visible to the build system. I would be willing to bet you a coffee and a donut that doing a survey of developers asking "would you be surprised with that italicized statement?", at least 75% of them would say "yes".

Except you are not committing changes to make them available to the build system. You are committing changes to make them available to the install base of your package manager.

Completely agree. And I hope that is tracked somewhere.

No need to track this, this is what the default pins give you. We used to have --working-dir as a default and it was a disaster: it does not scale. It would either break your switches in spectacular ways or worse, they would silently pickup changes you did not want to be picked up.

@jonahbeckford
Copy link
Contributor Author

You are right. I was being sloppy with the terminology of build system versus package manager. But I would still bet you a coffee and a donut with a reworded survey.

@dbuenzli
Copy link
Contributor

dbuenzli commented Feb 8, 2024

Of course users always think they know what they want, like faster horses. But they rarely understand all the implications of what they ask for (and the perils of global mutable state :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants