Deadlock in HydrateWithClient #1077

spencerschrock · 2024-06-27T19:45:10Z

The OpenSSF Scorecard project had a few of our Kubernetes pods get stuck while running osv-scanner. Our binaries had a few profiling/monitoring routines running, so while the executable never fatally deadlocked, no work was being performed.

Debugging with pprof showed the following goroutine stack dump:

goroutine profile: total 46
25 @ 0x4401ae 0x4099ad 0x409617 0xd1febb 0x4774c1
#	0xd1feba	github.com/google/osv-scanner/pkg/osv.HydrateWithClient.func1+0x11a	github.com/google/osv-scanner@v1.7.4/pkg/osv/osv.go:292

1 @ 0x4401ae 0x451e65 0xabec94 0xd1fb7c 0xf89cdc 0xf89cd7 0xf88cb9 0xf8c59f 0x122f02a 0x1237c18 0x1128eb1 0x1250550 0x4774c1
#	0xabec93	golang.org/x/sync/semaphore.(*Weighted).Acquire+0x333				golang.org/x/sync@v0.7.0/semaphore/semaphore.go:74
#	0xd1fb7b	github.com/google/osv-scanner/pkg/osv.HydrateWithClient+0x31b			github.com/google/osv-scanner@v1.7.4/pkg/osv/osv.go:285
#	0xf89cdb	github.com/google/osv-scanner/pkg/osv.Hydrate+0x6fb				github.com/google/osv-scanner@v1.7.4/pkg/osv/osv.go:265
#	0xf89cd6	github.com/google/osv-scanner/pkg/osvscanner.makeRequest+0x6f6			github.com/google/osv-scanner@v1.7.4/pkg/osvscanner/osvscanner.go:982
#	0xf88cb8	github.com/google/osv-scanner/pkg/osvscanner.DoScan+0x1278			github.com/google/osv-scanner@v1.7.4/pkg/osvscanner/osvscanner.go:837
#	0xf8c59e	github.com/ossf/scorecard/v5/clients.osvClient.ListUnfixedVulnerabilities+0x29e	github.com/ossf/scorecard/v5/clients/osv.go:55
#	0x122f029	github.com/ossf/scorecard/v5/checks/raw.Vulnerabilities+0x149			github.com/ossf/scorecard/v5/checks/raw/vulnerabilities.go:36
#	0x1237c17	github.com/ossf/scorecard/v5/checks.Vulnerabilities+0x57			github.com/ossf/scorecard/v5/checks/vulnerabilities.go:43
#	0x1128eb0	github.com/ossf/scorecard/v5/checker.(*Runner).Run+0x8f0			github.com/ossf/scorecard/v5/checker/check_runner.go:118
#	0x125054f	github.com/ossf/scorecard/v5/pkg.runEnabledChecks.func1+0x1af			github.com/ossf/scorecard/v5/pkg/scorecard.go:68

1 @ 0x4401ae 0x452ea5 0x452e74 0x4734a5 0x4842c8 0x1250332 0x4774c1
#	0x4734a4	sync.runtime_Semacquire+0x24				runtime/sema.go:62
#	0x4842c7	sync.(*WaitGroup).Wait+0x47				sync/waitgroup.go:116
#	0x1250331	github.com/ossf/scorecard/v5/pkg.runEnabledChecks+0x231	github.com/ossf/scorecard/v5/pkg/scorecard.go:71

## some unrelated monitoring and pprof routines I've omitted. I can provide it if needed.

Observations:

The projects that we were scanning when the deadlock occurred seem to have many open vulnerabilities, indicating there were a maximal number of concurrent requests (25+).
The frozen goroutines seem to be stuck on line 285 and 292 (of v1.7.4)

With all that in mind, I think the following occurred:

There were 25+ vulnerabilities which needed hydrating causing the semaphore to be depleted, and the "main" goroutine couldn't proceed

osv-scanner/pkg/osv/osv.go

Lines 283 to 285 in d4657bf

    
           for batchIdx, response := range resp.Results { 
        
           	for resultIdx, vuln := range response.Vulns { 
        
           		if err := rateLimiter.Acquire(ctx, 1); err != nil {

All 25 of the hydrating goroutines had an error calling GetWithClient but cannot send their error on the errChan because the "main" goroutine does not read from errChan until after starting a hydration goroutine for all vulns. This seems like an uncommon occurrence, but I think Failed to hydrate an OSV response due to an unexpected severity type format osv.dev#2335 caused an increased error rate this week and made this happen.

osv-scanner/pkg/osv/osv.go

Lines 289 to 292 in d4657bf

    
           go func(id string, batchIdx int, resultIdx int) { 
        
           	vuln, err := GetWithClient(id, client) 
        
           	if err != nil { 
        
           		errChan <- err

osv-scanner/pkg/osv/osv.go

Lines 313 to 315 in d4657bf

    
           for err := range errChan { 
        
           	return nil, err 
        
           }

Deadlock. All 25 requests were frozen until the "main" goroutine continued, which could not happen due to concurrency limits.

The text was updated successfully, but these errors were encountered:

spencerschrock mentioned this issue Jun 27, 2024

fix: use errgroup to avoid hydration deadlock scenario #1078

Merged

another-rex closed this as completed in #1078 Jun 28, 2024

another-rex closed this as completed in 3ea2155 Jun 28, 2024

spencerschrock mentioned this issue Jul 1, 2024

✨ Add important Go packages to projects.csv ossf/scorecard#4176

Merged

1 task

BrewTestBot mentioned this issue Jul 10, 2024

osv-scanner 1.8.2 Homebrew/homebrew-core#176888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in HydrateWithClient #1077

Deadlock in HydrateWithClient #1077

spencerschrock commented Jun 27, 2024 •

edited

Loading

Deadlock in HydrateWithClient #1077

Deadlock in HydrateWithClient #1077

Comments

spencerschrock commented Jun 27, 2024 • edited Loading

spencerschrock commented Jun 27, 2024 •

edited

Loading