Code hotspots #2412

spencerschrock · 2022-10-31T22:46:01Z

A significant portion (5.5%) of the cron job is spent in raw.isText

Lines 170 to 180 in 1fa7910

    
           func isText(content []byte) bool { 
        
           	for _, c := range string(content) { 
        
           		if c == '\t' || c == '\n' || c == '\r' { 
        
           			continue 
        
           		} 
        
           		if !unicode.IsPrint(c) { 
        
           			return false 
        
           		} 
        
           	} 
        
           	return true 
        
           }

There's some experimental code that also does this, and has better performance. I plan to import/copy the implementation over.

// IsText reports whether a significant prefix of s looks like correct UTF-8;
// that is, if it is likely that s is human-readable text.
func IsText(s []byte) bool {
	const max = 1024 // at least utf8.UTFMax
	if len(s) > max {
		s = s[0:max]
	}
	for i, c := range string(s) {
		if i+utf8.UTFMax > len(s) {
			// last char may be incomplete - ignore
			break
		}
		if c == 0xFFFD || c < ' ' && c != '\n' && c != '\t' && c != '\f' {
			// decoding error or control character - not a text file
			return false
		}
	}
	return true
}

Should be an easy performance win, but would change the behavior slightly as it only checks the first 1024 bytes.

Since you implemented the heuristic, thoughts @laurentsimon ? I know the code was introduced to reduce false positives, so should still be valid for that.

Benchmarking shows a 2x speedup on small files (1KiB). 6x speedup on medium files (3KiB). And on gigantic files (like our 50MB projects.csv file), the difference is 5-6 orders of magnitude.

cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
BenchmarkIsText-8                     79         295087760 ns/op
BenchmarkExternalIsText-8       13462120              1763 ns/op

The text was updated successfully, but these errors were encountered:

laurentsimon · 2022-11-01T06:48:04Z

Do you know why it is faster? What's the source of slowdown in the current version?
Looking at the first 1k byte would be OK if we have data (from cron) that confirms it does not change the results?

Btw, how did you perform the benchmark?

/cc @jeffmendoza

jeffmendoza · 2022-11-01T15:27:03Z

1k seems good to me.

spencerschrock · 2022-11-01T16:24:02Z

Do you know why it is faster? What's the source of slowdown in the current version?

We read the whole file instead of first 1KiB.
We call unicode.IsPrint, instead of doing the boolean expression inline which is expensive according to the profiling.

(pprof) list isText
Total: 12.68s
ROUTINE ======================== github.com/ossf/scorecard/v4/checks/raw.isText in github.com/ossf/scorecard/v4/checks/raw/binary_artifact.go
      80ms      700ms (flat, cum)  5.52% of Total
         .          .    166:	return true, nil
         .          .    167:}
         .          .    168:
         .          .    169:// TODO: refine this function.
         .          .    170:func isText(content []byte) bool {
      40ms      200ms    171:	for _, c := range string(content) {
      40ms       40ms    172:		if c == '\t' || c == '\n' || c == '\r' {
         .          .    173:			continue
         .          .    174:		}
         .      460ms    175:		if !unicode.IsPrint(c) {
         .          .    176:			return false
         .          .    177:		}
         .          .    178:	}
         .          .    179:	return true
         .          .    180:}

Looking at the first 1k byte would be OK if we have data (from cron) that confirms it does not change the results?

Might be hard to test that ahead of time, especially as the repos could change while doing the tests. I can play around with some comparisons using release test data before implementing anything.

Btw, how did you perform the benchmark?

with the built-in benchmarks. I had a harness like this for both functions and then ran the benchmarks go test -bench=Bench -benchtime=20s:

func BenchmarkIsText(b *testing.B) {
	buf, err := os.ReadFile("testdata/script-pkg-managers")
	if err != nil {
		b.Fatalf("read testdata: %v", err)
	}
	for i := 0; i < b.N; i++ {
		isText(buf)
	}
}

laurentsimon · 2022-11-01T16:59:19Z

Thanks for the additional info. So inlining code is what speed things up. I'm surprised there's a 2x speedup on 1k files due to that. In fact that seems weird to me because there's a single stack frame created to call the function...
It would be great to have cone numbers confirming we're not making things worse. Maybe we can add some logging code temporarily to call both function in the cron (gated with an env variable) and mine the logs for different results?

spencerschrock · 2022-11-01T18:14:03Z

In fact that seems weird to me because there's a single stack frame created to call the function...

In a loop, called for each rune in the file contents.

Maybe we can add some logging code temporarily to call both function in the cron (gated with an env variable) and mine the logs for different results?

👍 SGTM.
env gated variable, enabled in the release test. the repos get shuffled daily, so should be able to test over the week

spencerschrock added the needs discussion label Oct 31, 2022

This was referenced Nov 1, 2022

🌱 Enable comparison for alternative isText implementation #2414

Merged

🐛 modify alternative isText to accept carriage returns #2421

Merged

🌱 enable more performant isText #2433

Merged

spencerschrock closed this as completed in #2433 Nov 7, 2022

spencerschrock mentioned this issue Mar 5, 2024

🐛 Limit Binary Artifact file reads to first 1024 bytes #3923

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code hotspots #2412

Code hotspots #2412

spencerschrock commented Oct 31, 2022

laurentsimon commented Nov 1, 2022 •

edited

Loading

jeffmendoza commented Nov 1, 2022

spencerschrock commented Nov 1, 2022

laurentsimon commented Nov 1, 2022

spencerschrock commented Nov 1, 2022

Code hotspots #2412

Code hotspots #2412

Comments

spencerschrock commented Oct 31, 2022

laurentsimon commented Nov 1, 2022 • edited Loading

jeffmendoza commented Nov 1, 2022

spencerschrock commented Nov 1, 2022

laurentsimon commented Nov 1, 2022

spencerschrock commented Nov 1, 2022

laurentsimon commented Nov 1, 2022 •

edited

Loading