Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code hotspots #2412

Closed
spencerschrock opened this issue Oct 31, 2022 · 5 comments · Fixed by #2433
Closed

Code hotspots #2412

spencerschrock opened this issue Oct 31, 2022 · 5 comments · Fixed by #2433

Comments

@spencerschrock
Copy link
Contributor

A significant portion (5.5%) of the cron job is spent in raw.isText

func isText(content []byte) bool {
for _, c := range string(content) {
if c == '\t' || c == '\n' || c == '\r' {
continue
}
if !unicode.IsPrint(c) {
return false
}
}
return true
}

There's some experimental code that also does this, and has better performance. I plan to import/copy the implementation over.

// IsText reports whether a significant prefix of s looks like correct UTF-8;
// that is, if it is likely that s is human-readable text.
func IsText(s []byte) bool {
	const max = 1024 // at least utf8.UTFMax
	if len(s) > max {
		s = s[0:max]
	}
	for i, c := range string(s) {
		if i+utf8.UTFMax > len(s) {
			// last char may be incomplete - ignore
			break
		}
		if c == 0xFFFD || c < ' ' && c != '\n' && c != '\t' && c != '\f' {
			// decoding error or control character - not a text file
			return false
		}
	}
	return true
}

Should be an easy performance win, but would change the behavior slightly as it only checks the first 1024 bytes.

Since you implemented the heuristic, thoughts @laurentsimon ? I know the code was introduced to reduce false positives, so should still be valid for that.

Benchmarking shows a 2x speedup on small files (1KiB). 6x speedup on medium files (3KiB). And on gigantic files (like our 50MB projects.csv file), the difference is 5-6 orders of magnitude.

cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkIsText-8                     79         295087760 ns/op
BenchmarkExternalIsText-8       13462120              1763 ns/op
@laurentsimon
Copy link
Contributor

laurentsimon commented Nov 1, 2022

Do you know why it is faster? What's the source of slowdown in the current version?
Looking at the first 1k byte would be OK if we have data (from cron) that confirms it does not change the results?

Btw, how did you perform the benchmark?

/cc @jeffmendoza

@jeffmendoza
Copy link
Member

1k seems good to me.

@spencerschrock
Copy link
Contributor Author

Do you know why it is faster? What's the source of slowdown in the current version?

  1. We read the whole file instead of first 1KiB.
  2. We call unicode.IsPrint, instead of doing the boolean expression inline which is expensive according to the profiling.
(pprof) list isText
Total: 12.68s
ROUTINE ======================== github.com/ossf/scorecard/v4/checks/raw.isText in github.com/ossf/scorecard/v4/checks/raw/binary_artifact.go
      80ms      700ms (flat, cum)  5.52% of Total
         .          .    166:	return true, nil
         .          .    167:}
         .          .    168:
         .          .    169:// TODO: refine this function.
         .          .    170:func isText(content []byte) bool {
      40ms      200ms    171:	for _, c := range string(content) {
      40ms       40ms    172:		if c == '\t' || c == '\n' || c == '\r' {
         .          .    173:			continue
         .          .    174:		}
         .      460ms    175:		if !unicode.IsPrint(c) {
         .          .    176:			return false
         .          .    177:		}
         .          .    178:	}
         .          .    179:	return true
         .          .    180:}

Looking at the first 1k byte would be OK if we have data (from cron) that confirms it does not change the results?

Might be hard to test that ahead of time, especially as the repos could change while doing the tests. I can play around with some comparisons using release test data before implementing anything.

Btw, how did you perform the benchmark?

with the built-in benchmarks. I had a harness like this for both functions and then ran the benchmarks go test -bench=Bench -benchtime=20s:

func BenchmarkIsText(b *testing.B) {
	buf, err := os.ReadFile("testdata/script-pkg-managers")
	if err != nil {
		b.Fatalf("read testdata: %v", err)
	}
	for i := 0; i < b.N; i++ {
		isText(buf)
	}
}

@laurentsimon
Copy link
Contributor

Thanks for the additional info. So inlining code is what speed things up. I'm surprised there's a 2x speedup on 1k files due to that. In fact that seems weird to me because there's a single stack frame created to call the function...
It would be great to have cone numbers confirming we're not making things worse. Maybe we can add some logging code temporarily to call both function in the cron (gated with an env variable) and mine the logs for different results?

@spencerschrock
Copy link
Contributor Author

In fact that seems weird to me because there's a single stack frame created to call the function...

In a loop, called for each rune in the file contents.

Maybe we can add some logging code temporarily to call both function in the cron (gated with an env variable) and mine the logs for different results?

👍 SGTM.
env gated variable, enabled in the release test. the repos get shuffled daily, so should be able to test over the week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants