Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Go modules with LFS committed files #10052

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

danielorbach
Copy link
Contributor

@danielorbach danielorbach commented Jun 20, 2024

What are you trying to accomplish?

My company commits large files using Git LFS which happen to be part of our Go module source tree.
Without git-lfs installed, Dependabot's local git installation does not automatically resolve those files, causesing "checksum mismatch" against the committed go.sum.

Git LFS support is important when downloading Go modules directly from git repositories. Direct download is more common for private repositories. It is also a valid method to fetch module archives for public repositories, although those are often downloaded from the module-proxy.

Anything you want to highlight for special attention from reviewers?

While Git LFS may be relevant for other ecosystems, I've been heavily occupied with Go for the past several years. Thus, I'm unable to offer a qualified solution for all ecosystems. Nonetheless, I think this solution is very relevant for companies that rely on GitHub to host their internal codebase.

How will you know you've accomplished your goal?

While dependabot correctly detects a new version is available, when the repo commits large-files, the generated pull-request contains the wrong module hash, rendering it effectively meaningless.

I've used the following snippets to prove, and consistently test the problem and its solution:

$ git clone https://github.com/danielorbach/dependabot-go_modules.git lfs-enabled
$ go run github.com/vikyd/go-checksum@latest lfs-enabled github.com/danielorbach/dependabot-go_modules@latest
directory: lfs-enabled
{
        "HashSynthesized": "3abfcf88a6297313f01eb913c7d2a44d3a2302e4c09324bb59b411057f77ed76",
        "HashSynthesizedBase64": "Or/PiKYpcxPwHrkTx9KkTTojAuTAkyS7WbQRBX937XY=",
        "GoCheckSum": "h1:Or/PiKYpcxPwHrkTx9KkTTojAuTAkyS7WbQRBX937XY="
}
$ export GIT_LFS_SKIP_SMUDGE=1 # This env disables the effects of Git LFS
$ git clone https://github.com/danielorbach/dependabot-go_modules.git lfs-disabled
$ go run github.com/vikyd/go-checksum@latest lfs-disabled github.com/danielorbach/dependabot-go_modules@latest
directory: lfs-disabled
{
        "HashSynthesized": "61e260ad4da7812fd89bcaebdcdcc9121098115b05bc97d3c914b25d69aa0588",
        "HashSynthesizedBase64": "YeJgrU2ngS/Ym8rr3NzJEhCYEVsFvJfTyRSyXWmqBYg=",
        "GoCheckSum": "h1:YeJgrU2ngS/Ym8rr3NzJEhCYEVsFvJfTyRSyXWmqBYg="
}

We are interested in the value of GoCheckSum field because that is the value that gets written to go.sum.

Notice how the value changed when Git LFS is disabled.

P.S. I've created danielorbach/dependabot-go_modules to demonstrate the scenario. A more elaborate setup can be made to incorporate dependabot.

Checklist

  • I have run the complete test suite to ensure all tests and linters pass.
  • I have thoroughly tested my code changes to ensure they work as expected, including adding additional tests for new functionality.
  • I have written clear and descriptive commit messages.
  • I have provided a detailed description of the changes in the pull request, including the problem it addresses, how it fixes the problem, and any relevant details about the implementation.
  • I have ensured that the code is well-documented and easy to understand.

@danielorbach danielorbach marked this pull request as ready for review June 24, 2024 09:00
@danielorbach danielorbach requested a review from a team as a code owner June 24, 2024 09:00
@github-actions github-actions bot added the L: go:modules Golang modules label Jun 24, 2024
@jurre
Copy link
Member

jurre commented Jun 24, 2024

Note that git lfs is already installed:

However, we disable pulling all lfs files by default:

ENV GIT_LFS_SKIP_SMUDGE=1

I don't think that we'd want to enable pulling large files for all repo's by default, as it would result in a lot of unnecessary data being pulled in.

We make some exceptions around pulling in large files for Yarn Berry for example:

# Overridden to pull any yarn data or plugins which may be stored with Git LFS.
sig { override.returns(String) }
def clone_repo_contents
return @git_lfs_cloned_repo_contents_path unless @git_lfs_cloned_repo_contents_path.nil?
@git_lfs_cloned_repo_contents_path ||= T.let(super, T.nilable(String))
begin
SharedHelpers.with_git_configured(credentials: credentials) do
Dir.chdir(@git_lfs_cloned_repo_contents_path) do
cache_dir = Helpers.fetch_yarnrc_yml_value("cacheFolder", "./yarn/cache")
SharedHelpers.run_shell_command("git lfs pull --include .yarn,#{cache_dir}")
end
@git_lfs_cloned_repo_contents_path
end
rescue StandardError
@git_lfs_cloned_repo_contents_path
end
end

Perhaps we could explore something similar here? Tbh, I have not seen lfs used in go modules that much, we support a fairly large number of go users and I think this is the first time its come up. What exactly are you pulling in that requires lfs? I would think that typically go modules are just source code and do not require lfs to be enabled for those paths?

@danielorbach danielorbach changed the title Main Support Go modules with LFS committed files Jun 24, 2024
@danielorbach
Copy link
Contributor Author

danielorbach commented Jun 25, 2024

Hi @jurre !

I did not notice that about the core image. I've tracked1 that change all the way back to commit 35ea619 as part of #5833 to resolve the thread that begins in #1297 (comment).

That solution would not work for us Gophers because Dependabot scripts have nowhere to intervene while the Go toolchain clones repositories and computes their hashes. The only way I see forward is to override the previously set ENV in the Docker image to GIT_LFS_SKIP_SMUDGE=0 sometime before invoking go commands.

Preferably, I'd do that for all runs of the go_modules ecosystem. As you say, the vast majority of clients for Dependabot do not have LFS installed, so they would not be affected by this "global" change. Furthermore, I don't believe this change can hurt any Go module that had committed large files because golang/go#42965 had not been merged yet (I did not see any work had begun on it either).
Just to be clear, allow me to recap my hypothesis: Had any Go module been using Dependabot and committing files to LFS2, they would enjoying a fast dependency scan but get their hashes wrong, in which case the opened pull-requests would contain wrong hashes. So, correcting the behaviour of Dependabot just for repositories with LFS and Go modules would either fix problems those users were unaware of, or that they were aware of and now are finally resolved.

I have never coded in Ruby, but I can try and copy-paste some configuration check with the help of AI to exclude large repositories that would otherwise suffer performance degradation3. Just point me at a similar commit in the go_modules ecosystem.

Thank you for your swift reply 🦃
Let me know how you want to proceed 🌮

Footnotes

  1. with git log -M -C -C -C -S'GIT_LFS_SKIP_SMUDGE' (takes a while to run)

  2. given that those large files are under the go.mod directory tree

  3. though as you say, this is the first time it's come up, so I doubt that is worth the effort (but you are the boss 🤓)

@jurre
Copy link
Member

jurre commented Jun 25, 2024

Would turning the FLS setting on globally not cause every module not using lfs to have wrong hashes?

@danielorbach
Copy link
Contributor Author

Would turning the FLS setting on globally not cause every module not using lfs to have wrong hashes?

No, that is not the case. Allow me to elaborate on that answer:

All modules (whether with or without files committed to LFS) compute hashes in the same way: hashing all the files in the directory tree that has go.mod in its root.
That is why every module not using lfs is not affected by turning the LFS filters on or off.

@danielorbach
Copy link
Contributor Author

@jurre If you're unhappy with the solution, I'd like to answer further questions and iterate on it. Otherwise, I've synched this PR with the latest upstream main in preparation for merging.

Note, I clicked the "Sync" button on GitHub which merges. Alternatively, I can rebase if that is your preferred method.

@danielorbach
Copy link
Contributor Author

Oops my bad, I did not push the updated solution 😦

I've committed a version that automatically pulls large-files on the go_modules ecosystem. Let me know what you think about it.

@jurre
Copy link
Member

jurre commented Jul 9, 2024

Hey apologies, I'd missed the mention.

Doing it for only go is definitely an improvement. And so if I understand correctly, we need basically everything that's nested in root folder (as far as Dependabot is concerned) to pull in large files, not necessarily only the vendor folder?

Basically, if I have some 5GB video committed to the repository in say an assets/ folder, for whatever reason, it affects how the hashes are computed for that repository?

@sachin-sandhu
Copy link
Contributor

Note: Don't merge this pull request, we are monitoring the release.

Copy link
Contributor

@sachin-sandhu sachin-sandhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for monitoring (no merging)

@danielorbach
Copy link
Contributor Author

Doing it for only go is definitely an improvement. And so if I understand correctly, we need basically everything that's nested in root folder (as far as Dependabot is concerned) to pull in large files, not necessarily only the vendor folder?

Yes.

Basically, if I have some 5GB video committed to the repository in say an assets/ folder, for whatever reason, it affects how the hashes are computed for that repository?

Exactly.


Just to be thorough: in a lot of time from today, golang/go#42965 will be merged and released as a new feature of the Go toolchain. When that happens, Go modules (that is, go.mod files in a repository) will be able to exclude some files (for example, those 5GB videos) from the computation that this PR references. However, given that until today you got absolutely zero complaints about miscalculated hashes, I conclude that there aren't any users with such large files in Go repositories, so none would suffer the cost (bandwidth or time quotas) of pulling large files in Dependabot's Go repositories.

Another clarification on modules with go.mod nested deeper in the repository: If a repository has large files, but are outside of the directory tree that contains the go.mod file, the Go toolchain will not pull them. For example, in a repo with a file in path file.bin and a Go module in path some/nested/go.mod, the file file.bin is never downloaded from the Git server by the Go toolchain.

Turning GIT_LFS_SKIP_SMUDGE on enables automatic pulling by default.
This default is respected by the Go toolchain when downloading module
sources directly from git servers (e.g. GitHub).

Git LFS support is important when Go modules embed large files using
go:embed. Direct download is more common for private repositories but
it is also possible for public repositories (which are often downloaded
from the module-proxy).

The Go toolchain uses the local git installation to fetch archives of
modules from Git servers directly. Some companies commits large files
using Git LFS which happen to be part of their Go module source tree.
Without git-lfs installed, the local git installation does not
automatically resolve those files which causes "checksum mismatch"
against the committed go.sum.
@sachin-sandhu sachin-sandhu merged commit 95096a6 into dependabot:main Jul 15, 2024
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
L: go:modules Golang modules
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants