Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order CPEs deterministically for SBOM reproducibility #2967

Closed
luhring opened this issue Jun 15, 2024 · 9 comments · Fixed by #3009 or #3085
Closed

Order CPEs deterministically for SBOM reproducibility #2967

luhring opened this issue Jun 15, 2024 · 9 comments · Fixed by #3009 or #3085
Assignees
Labels
bug Something isn't working

Comments

@luhring
Copy link
Contributor

luhring commented Jun 15, 2024

What happened:

I'm seeing nondeterministic behavior when using Syft as a library (in wolfictl) to generate SBOMs. I noticed this via new golden-file style tests we've introduced, to ensure we get the same output for the same input. For a couple of the test targets (which are each APK files), a test will fail on the next run immediately following that test's golden file update.

I'm not 100% sure this is Syft's fault yet, since there's wrapping code in wolfictl involved, too. But wanted to flag the issue here at least so we can discuss!

Here are some example diffs from two consecutive runs of the SBOM generation code under this test:

For jenkins-2.461-r0.apk

 "language": "java",
 "cpes": [
   {
-    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:jenkins:*:*",
-    "source": "nvd-cpe-dictionary"
-  },
-  {
-    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:*:*:*",
+    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:*:*:*",
+    "source": "nvd-cpe-dictionary"
+  },
+  {
+    "cpe": "cpe:2.3:a:jenkins:mailer:472.vf7c289a_4b_420:*:*:*:*:jenkins:*:*",
     "source": "nvd-cpe-dictionary"

For jruby-9.4-9.4.7.0-r0.apk:

 {
-"id": "b18c20e1cb65977c",
+"id": "1ac1bd89c6841000",
 "name": "jruby-base",
 "version": "9.4.7.0",
 "type": "java-archive",
 ...
     }
   }
 ],
-"licenses": [],
+"licenses": [
+  {
+    "value": "Apache-2.0",
+    "spdxExpression": "Apache-2.0",
+    "type": "concluded",
+    "urls": [],
+    "locations": [
+      {
+        "path": "usr/share/jruby/lib/jruby.jar",
+        "accessPath": "usr/share/jruby/lib/jruby.jar",
+        "annotations": {
+          "evidence": "primary"
+        }
+      }
+    ]
+  }
+],

What you expected to happen:

Same exact output given same input!

Steps to reproduce the issue:

Check out https://github.com/wolfi-dev/wolfictl and run the test linked above. Note that you may have to run the test multiple times in order to get a complete sense of the results that code can produce. Also note that the first run of the test is doing a fetch of several APKs, so it will take considerably more time than subsequent test runs.

Anything else we need to know?:

So far the only test cases exhibiting this behavior are Java-based packages... 🤔

cc: @wagoodman, this is the thing we talked about briefly last week.

Environment:

  • Output of syft version:
$ go list -m all | grep syft           
github.com/anchore/syft v1.7.0
  • OS (e.g: cat /etc/os-release or similar): latest macOS
@luhring luhring added the bug Something isn't working label Jun 15, 2024
@urld
Copy link

urld commented Jul 3, 2024

I observed a similar problem when running the same scan directly with the syft cli twice in a row i get some non deterministic selection of the cpe in the CycloneDX sbom:

{
  "bom-ref": "pkg:pypi/cryptography@3.2.1?package-id=a4e081620662b87d",
  "type": "library",
  "author": "The cryptography developers <cryptography-dev@python.org>",
  "name": "cryptography",
  "version": "3.2.1",
  "licenses": [
	{
	  "license": {
		"name": "BSD or Apache License, Version 2.0"
	  }
	}
  ],
-  "cpe": "cpe:2.3:a:python-cryptography_project:python-cryptography:3.2.1:*:*:*:*:*:*:*",
+  "cpe": "cpe:2.3:a:cryptography_project:cryptography:3.2.1:*:*:*:*:python:*:*",
  "purl": "pkg:pypi/cryptography@3.2.1",
  "properties": [
	{
	  "name": "syft:package:foundBy",
	  "value": "python-installed-package-cataloger"
	},
	{
	  "name": "syft:package:language",
	  "value": "python"
	},
	{
	  "name": "syft:package:type",
	  "value": "python"
	},
	{
	  "name": "syft:package:metadataType",
	  "value": "python-package"
	},
	{
	  "name": "syft:cpe23",
-	  "value": "cpe:2.3:a:cryptography_project:cryptography:3.2.1:*:*:*:*:python:*:*"
+	  "value": "cpe:2.3:a:python-cryptography_project:python-cryptography:3.2.1:*:*:*+:*:*:*:*"
	},
	{
	  "name": "syft:location:0:path",
	  "value": "usr/lib64/python3.6/site-packages/cryptography-3.2.1-py3.6.egg-info/PKG-INFO"
	},
	{
	  "name": "syft:location:1:path",
	  "value": "usr/lib64/python3.6/site-packages/cryptography-3.2.1-py3.6.egg-info/top_level.txt"
	}
  ]
}

I get that cpe's are not an exact science, but i think the selection of the cpe candidates should be done in an deterministic manner to reduce noise.

@urld
Copy link

urld commented Jul 3, 2024

I think the sorting of the cpe's would just need to be extended to use lexicographical order in case of "ties":

getRank := func(source Source) int {
if rank, exists := sourceOrder[source]; exists {
return rank
}
return 4 // Sourced we don't know about can't be assigned special priority, so
// are considered ties.
}

@spiffcs spiffcs self-assigned this Jul 3, 2024
@spiffcs
Copy link
Contributor

spiffcs commented Jul 3, 2024

Thanks @luhring for the issue here. I've first focused on the CPE ordering. While I wasn't able to reproduce your exact example I've found non determinism in the sorting here:

<           "cpe": "cpe:2.3:a:jenkins:pipeline_supporting_apis:865.v43e78cc44e0d:*:*:*:*:jenkins:*:*",
---
>           "cpe": "cpe:2.3:a:jenkins:pipeline\\:_supporting_apis:865.v43e78cc44e0d:*:*:*:*:jenkins:*:*",

I added your case and the one I found when reproducing to our tests found here(branch not yet pushed):
https://github.com/anchore/syft/blob/573440b7cf82cc1f21fdeec0da5d744d6b110db5/syft/cpe/by_source_then_specificity_test.go

After running them without cache to try and see if I could isolate where the reordering was happening and if it was indeed non deterministic at the cpe sorting level I got a 100% pass rate.

for i in (seq 1 100); go test -count=1 ./syft/cpe/...; end
ok      github.com/anchore/syft/syft/cpe        0.251s
ok      github.com/anchore/syft/syft/cpe        0.238s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.240s
ok      github.com/anchore/syft/syft/cpe        0.235s
ok      github.com/anchore/syft/syft/cpe        0.234s
ok      github.com/anchore/syft/syft/cpe        0.237s
ok      github.com/anchore/syft/syft/cpe        0.235s
...

This tells me that the sort we have written is stable. Given your hypothesis about it only being for java packages I'm going to check there and make sure we're using the correct CPE sorting when assembling those packages as a first step to resolve this.

@spiffcs
Copy link
Contributor

spiffcs commented Jul 3, 2024

@luhring check out the branch in draft I created here and let me know if that seems to work for the jenkins package you mentioned.

I'm working on the jruby case now, but wanted to see if the comment I made at the bottom of that PR about certain java packages "winning" non deterministically for the final spot in the SBOM was similar to the JRUBY case you posted.

The package case I found that caused me issues for my sample image that reproduce the issue was jansi

@luhring
Copy link
Contributor Author

luhring commented Jul 9, 2024

Thanks @spiffcs! I just tried out the repro steps using Syft at f7ffcc5 and I'm still seeing the nondeterminism, unfortunately.

Do you want to share which part of the repro didn't work for you and we can go from there? I know it'd be easier to fix this if you could see what I'm seeing, so let me know how I can help. 🙇

@willmurphyscode willmurphyscode changed the title Nondeterministic SBOM generation Order CPEs deterministically for SBOM reproducibility Jul 11, 2024
@spiffcs
Copy link
Contributor

spiffcs commented Jul 18, 2024

@luhring I think it makes sense that you might still be seeing it - as of today I have an example I'm working with that has non deterministic order for java packages. Work is in progress to make that the same for every run.

The issue arises when java packages are discovered in different order and one "wins" against another during deduplication. The winner during dedupe is not always consistent. I'm trying to track down all cases right now.

I've got an example working with these two packages now that shows the inconsistency so I think we're in good shape as being on the same sheet of music

"jenkins-2.461-r0.apk",
"jruby-9.4-9.4.7.0-r0.apk",

@spiffcs spiffcs reopened this Jul 18, 2024
@luhring
Copy link
Contributor Author

luhring commented Jul 22, 2024

Okay great! It sounds like you're off and running. Let me know if you think of anything I can do here. 🙇

@kzantow
Copy link
Contributor

kzantow commented Jul 30, 2024

Hey @luhring -- I think everything reported in this issue has been taken care of, but please do let us know if you see any further instances of nonterminism!

@luhring
Copy link
Contributor Author

luhring commented Jul 31, 2024

I've just run the tests several times with Syft 1.10.0 and it appears to be stable. 🎉

Thanks so much!! I'll report back if I find anything unexpected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
4 participants