Order CPEs deterministically for SBOM reproducibility #2967

luhring · 2024-06-15T15:47:29Z

What happened:

I'm seeing nondeterministic behavior when using Syft as a library (in wolfictl) to generate SBOMs. I noticed this via new golden-file style tests we've introduced, to ensure we get the same output for the same input. For a couple of the test targets (which are each APK files), a test will fail on the next run immediately following that test's golden file update.

I'm not 100% sure this is Syft's fault yet, since there's wrapping code in wolfictl involved, too. But wanted to flag the issue here at least so we can discuss!

Here are some example diffs from two consecutive runs of the SBOM generation code under this test:

For `jenkins-2.461-r0.apk`

 "language": "java",
 "cpes": [
   {
-    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:jenkins:*:*",
-    "source": "nvd-cpe-dictionary"
-  },
-  {
-    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:*:*:*",
+    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:*:*:*",
+    "source": "nvd-cpe-dictionary"
+  },
+  {
+    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:jenkins:*:*",
     "source": "nvd-cpe-dictionary"

For `jruby-9.4-9.4.7.0-r0.apk`:

 {
-"id": "b18c20e1cb65977c",
+"id": "1ac1bd89c6841000",
 "name": "jruby-base",
 "version": "9.4.7.0",
 "type": "java-archive",
 ...
     }
   }
 ],
-"licenses": [],
+"licenses": [
+  {
+    "value": "Apache-2.0",
+    "spdxExpression": "Apache-2.0",
+    "type": "concluded",
+    "urls": [],
+    "locations": [
+      {
+        "path": "usr/share/jruby/lib/jruby.jar",
+        "accessPath": "usr/share/jruby/lib/jruby.jar",
+        "annotations": {
+          "evidence": "primary"
+        }
+      }
+    ]
+  }
+],

What you expected to happen:

Same exact output given same input!

Steps to reproduce the issue:

Check out https://github.com/wolfi-dev/wolfictl and run the test linked above. Note that you may have to run the test multiple times in order to get a complete sense of the results that code can produce. Also note that the first run of the test is doing a fetch of several APKs, so it will take considerably more time than subsequent test runs.

Anything else we need to know?:

So far the only test cases exhibiting this behavior are Java-based packages... 🤔

cc: @wagoodman, this is the thing we talked about briefly last week.

Environment:

Output of syft version:

$ go list -m all | grep syft           
github.com/anchore/syft v1.7.0

OS (e.g: cat /etc/os-release or similar): latest macOS

The text was updated successfully, but these errors were encountered:

urld · 2024-07-03T11:44:12Z

I observed a similar problem when running the same scan directly with the syft cli twice in a row i get some non deterministic selection of the cpe in the CycloneDX sbom:

{
  "bom-ref": "pkg:pypi/cryptography@3.2.1?package-id=a4e081620662b87d",
  "type": "library",
  "author": "The cryptography developers <cryptography-dev@python.org>",
  "name": "cryptography",
  "version": "3.2.1",
  "licenses": [
	{
	  "license": {
		"name": "BSD or Apache License, Version 2.0"
	  }
	}
  ],
-  "cpe": "cpe:2.3:a:python-cryptography_project:python-cryptography:3.2.1:*:*:*:*:*:*:*",
+  "cpe": "cpe:2.3:a:cryptography_project:cryptography:3.2.1:*:*:*:*:python:*:*",
  "purl": "pkg:pypi/cryptography@3.2.1",
  "properties": [
	{
	  "name": "syft:package:foundBy",
	  "value": "python-installed-package-cataloger"
	},
	{
	  "name": "syft:package:language",
	  "value": "python"
	},
	{
	  "name": "syft:package:type",
	  "value": "python"
	},
	{
	  "name": "syft:package:metadataType",
	  "value": "python-package"
	},
	{
	  "name": "syft:cpe23",
-	  "value": "cpe:2.3:a:cryptography_project:cryptography:3.2.1:*:*:*:*:python:*:*"
+	  "value": "cpe:2.3:a:python-cryptography_project:python-cryptography:3.2.1:*:*:*+:*:*:*:*"
	},
	{
	  "name": "syft:location:0:path",
	  "value": "usr/lib64/python3.6/site-packages/cryptography-3.2.1-py3.6.egg-info/PKG-INFO"
	},
	{
	  "name": "syft:location:1:path",
	  "value": "usr/lib64/python3.6/site-packages/cryptography-3.2.1-py3.6.egg-info/top_level.txt"
	}
  ]
}

I get that cpe's are not an exact science, but i think the selection of the cpe candidates should be done in an deterministic manner to reduce noise.

urld · 2024-07-03T12:53:23Z

I think the sorting of the cpe's would just need to be extended to use lexicographical order in case of "ties":

syft/syft/cpe/by_source_then_specificity.go

Lines 18 to 24 in 573440b

    
           getRank := func(source Source) int { 
        
           	if rank, exists := sourceOrder[source]; exists { 
        
           		return rank 
        
           	} 
        
           	return 4 // Sourced we don't know about can't be assigned special priority, so 
        
           	// are considered ties. 
        
           }

spiffcs · 2024-07-03T18:28:12Z

Thanks @luhring for the issue here. I've first focused on the CPE ordering. While I wasn't able to reproduce your exact example I've found non determinism in the sorting here:

<           "cpe": "cpe:2.3:a:jenkins:pipeline_supporting_apis:865.v43e78cc44e0d:*:*:*:*:jenkins:*:*",
---
>           "cpe": "cpe:2.3:a:jenkins:pipeline\\:_supporting_apis:865.v43e78cc44e0d:*:*:*:*:jenkins:*:*",

I added your case and the one I found when reproducing to our tests found here(branch not yet pushed):
https://github.com/anchore/syft/blob/573440b7cf82cc1f21fdeec0da5d744d6b110db5/syft/cpe/by_source_then_specificity_test.go

After running them without cache to try and see if I could isolate where the reordering was happening and if it was indeed non deterministic at the cpe sorting level I got a 100% pass rate.

for i in (seq 1 100); go test -count=1 ./syft/cpe/...; end
ok      github.com/anchore/syft/syft/cpe        0.251s
ok      github.com/anchore/syft/syft/cpe        0.238s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.240s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.234s
ok      github.com/anchore/syft/syft/cpe        0.237s
ok      github.com/anchore/syft/syft/cpe        0.235s
...

This tells me that the sort we have written is stable. Given your hypothesis about it only being for java packages I'm going to check there and make sure we're using the correct CPE sorting when assembling those packages as a first step to resolve this.

spiffcs · 2024-07-03T19:35:37Z

@luhring check out the branch in draft I created here and let me know if that seems to work for the jenkins package you mentioned.

I'm working on the jruby case now, but wanted to see if the comment I made at the bottom of that PR about certain java packages "winning" non deterministically for the final spot in the SBOM was similar to the JRUBY case you posted.

The package case I found that caused me issues for my sample image that reproduce the issue was jansi

luhring · 2024-07-09T18:53:43Z

Thanks @spiffcs! I just tried out the repro steps using Syft at f7ffcc5 and I'm still seeing the nondeterminism, unfortunately.

Do you want to share which part of the repro didn't work for you and we can go from there? I know it'd be easier to fix this if you could see what I'm seeing, so let me know how I can help. 🙇

spiffcs · 2024-07-18T16:48:58Z

@luhring I think it makes sense that you might still be seeing it - as of today I have an example I'm working with that has non deterministic order for java packages. Work is in progress to make that the same for every run.

The issue arises when java packages are discovered in different order and one "wins" against another during deduplication. The winner during dedupe is not always consistent. I'm trying to track down all cases right now.

I've got an example working with these two packages now that shows the inconsistency so I think we're in good shape as being on the same sheet of music

"jenkins-2.461-r0.apk",
"jruby-9.4-9.4.7.0-r0.apk",

luhring · 2024-07-22T15:26:05Z

Okay great! It sounds like you're off and running. Let me know if you think of anything I can do here. 🙇

kzantow · 2024-07-30T16:16:13Z

Hey @luhring -- I think everything reported in this issue has been taken care of, but please do let us know if you see any further instances of nonterminism!

luhring · 2024-07-31T15:03:10Z

I've just run the tests several times with Syft 1.10.0 and it appears to be stable. 🎉

Thanks so much!! I'll report back if I find anything unexpected

luhring added the bug Something isn't working label Jun 15, 2024

anchoretoolsops added this to OSS Jun 15, 2024

spiffcs moved this to In Progress in OSS Jul 3, 2024

spiffcs self-assigned this Jul 3, 2024

spiffcs mentioned this issue Jul 3, 2024

fix: stabilize cpe sorting during collection sort #3009

Merged

spiffcs closed this as completed in #3009 Jul 9, 2024

github-project-automation bot moved this from In Progress to Done in OSS Jul 9, 2024

willmurphyscode changed the title ~~Nondeterministic SBOM generation~~ Order CPEs deterministically for SBOM reproducibility Jul 11, 2024

BrewTestBot mentioned this issue Jul 11, 2024

syft 1.9.0 Homebrew/homebrew-core#177069

Merged

spiffcs reopened this Jul 18, 2024

spiffcs moved this from Done to In Progress in OSS Jul 23, 2024

kzantow mentioned this issue Jul 30, 2024

fix: improve determinism in java archive identification #3085

Merged

kzantow closed this as completed in #3085 Jul 30, 2024

github-project-automation bot moved this from In Progress to Done in OSS Jul 30, 2024

kzantow mentioned this issue Jul 30, 2024

fields are not identical between scans in maven #2463

Closed

BrewTestBot mentioned this issue Jul 30, 2024

syft 1.10.0 Homebrew/homebrew-core#179004

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order CPEs deterministically for SBOM reproducibility #2967

Order CPEs deterministically for SBOM reproducibility #2967

luhring commented Jun 15, 2024

urld commented Jul 3, 2024

urld commented Jul 3, 2024 •

edited

Loading

spiffcs commented Jul 3, 2024 •

edited

Loading

spiffcs commented Jul 3, 2024

luhring commented Jul 9, 2024

spiffcs commented Jul 18, 2024

luhring commented Jul 22, 2024

kzantow commented Jul 30, 2024

luhring commented Jul 31, 2024

Order CPEs deterministically for SBOM reproducibility #2967

Order CPEs deterministically for SBOM reproducibility #2967

Comments

luhring commented Jun 15, 2024

For jenkins-2.461-r0.apk

For jruby-9.4-9.4.7.0-r0.apk:

urld commented Jul 3, 2024

urld commented Jul 3, 2024 • edited Loading

spiffcs commented Jul 3, 2024 • edited Loading

spiffcs commented Jul 3, 2024

luhring commented Jul 9, 2024

spiffcs commented Jul 18, 2024

luhring commented Jul 22, 2024

kzantow commented Jul 30, 2024

luhring commented Jul 31, 2024

For `jenkins-2.461-r0.apk`

For `jruby-9.4-9.4.7.0-r0.apk`:

urld commented Jul 3, 2024 •

edited

Loading

spiffcs commented Jul 3, 2024 •

edited

Loading