Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Binary-Artifact and Pinned-Dependencies kill Scorecard in a repo with large files #3831

Open
pnacht opened this issue Jan 30, 2024 · 7 comments

Comments

@pnacht
Copy link
Contributor

pnacht commented Jan 30, 2024

Describe the bug
Running Scorecard with Binary-Artifact and/or Pinned-Dependencies on a repo with large files crashes entirely.

Reproduction steps
I stumbled on this while trying to run Scorecard on a local clone of a HuggingFace model repository.

Steps to reproduce the behavior:

# Requires LFS support; ~6GB download!
$ git clone https://huggingface.co/microsoft/phi-2

# same outcome with each check individually, too
$ scorecard --local ~/phi-2 --checks Binary-Artifacts,Pinned-Dependencies
Starting [Binary-Artifacts]
Starting [Pinned-Dependencies]
signal: killed

Deleting the very large files (including the .git folder), the checks pass. (There may be other checks that would also fail, I only tested those that run with --local)

Expected behavior
The checks should work even with large files.

As described below, Binary-Artifacts doesn't need to load the entire file, and it's unlikely an actual script will ever be big enough to be a problem.

Additional context
I believe I understand why these checks are failing: both have at least one function (BinaryArtifacts and collectShellScriptInsecureDownloads) that runs fileparser.OnMatchingFileContentDo with Pattern: "*" (i.e. all files).

As the function name implies, this function sequentially opens and loads all matching files. I assume one of the files was simply too large.

This should be fixable, though:

  • BinaryArtifacts uses fileparser.OnMatchingFileContentDo to call checkBinaryFileContent. That loads the file and then uses https://github.com/h2non/filetype to determine the file's type. This can be replaced by:
    • Simply trusting a file's extension, and only checking its contents if it has no extension (since it could be anything)
    • Replacing fileparser.OnMatchingFileContentDo with a function that only loads the first 262 bytes h2non/filetype needs
  • Pinned-Deps' collectShellScriptInsecureDownloads could be set to only run on files with common script extensions (i.e. .sh, .bash, .ps, and no extension).
@spencerschrock
Copy link
Contributor

spencerschrock commented Jan 31, 2024

Note: there's also some checks that use fileparser.OnAllFilesDo, which more of the checks use:

  • Dependency-Update-Tool
  • License
  • Security-Policy

Replacing fileparser.OnMatchingFileContentDo with a function that only loads the first 262 bytes h2non/filetype needs

Similarly, we check the first 1024 bytes in another part.

// determines if the first 1024 bytes are text
//
// A version of golang.org/x/tools/godoc/util modified to allow carriage returns
// and utf8.RuneError (0xFFFD), as the file may not be utf8 encoded.
func isText(s []byte) bool {
const max = 1024 // at least utf8.UTFMax

I can't say i've been a fan of the read it all at once aspect, instead of using io.Reader and io.Writermore. I wonder if there's something we can do instead. Although the []byte runs deep throughout the code, including the RepoClient interface.

type RepoClient interface {
InitRepo(repo Repo, commitSHA string, commitDepth int) error
URI() string
IsArchived() (bool, error)
ListFiles(predicate func(string) (bool, error)) ([]string, error)
// Returns an absolute path to the local repository
// in the format that matches the local OS
LocalPath() (string, error)
GetFileContent(filename string) ([]byte, error)

@spencerschrock
Copy link
Contributor

When profiling the weekly cron with pprof, file i/o is one of our largest chunks of time.

Not every GetFileContent needs the whole file, sometimes we don’t need any of it. So the io.Reader part allows the callers to choose how much of the file to use (whether that's all of it via io.ReadAll, 1KB, none, etc)

Additionally, io.ReadCloser is a better choice, so the caller can close it when they’re done.

GetFileContent(filename string) (io.ReadCloser, error)

There are a lot of existing usages of GetFileContent, which would need changed, but @adg offered an incremental, non-breaking way of introducing the change.

first add [a new] method to the concrete client implementations, then you can have a separate interface (even defined internally) that you test for with a type assertion, and use the new method when it is available

type fileReader interface {
  ReadFile(string) (io.ReadCloser, error)
}

fr, ok := client.(fileReader)
if ok {
  // use the new method fr.ReadFile
} else {
  // use client.GetFileContent
}

@spencerschrock spencerschrock added this to the v5 milestone Feb 28, 2024
@spencerschrock
Copy link
Contributor

@pnacht I didn't see a crash on my machine for the repo, but my VM may have more resources. Does this prototype branch eliminate the crash for Binary-Artifacts?

https://github.com/spencerschrock/scorecard/tree/reader-partial-interface

@pnacht
Copy link
Contributor Author

pnacht commented Feb 29, 2024

Nope. But I noticed you added the new behavior to the githubrepo client, but this is a local repo, so I believe it runs on the localdir client?

@spencerschrock
Copy link
Contributor

doh! pushed another commit. I'm noticing a significant speedup now

@pnacht
Copy link
Contributor Author

pnacht commented Mar 1, 2024

Yep, just ran it on my localdir and it runs!

@spencerschrock
Copy link
Contributor

The breaking change was already made in #3912, so removing from v5 milestone, but there are still some callers to fileparser.OnMatchingFileContentDo so parts of the issue are still applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog - New Checks
Development

No branches or pull requests

2 participants