Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #931 : Support for Google OSV #1703

Merged
merged 28 commits into from
Jul 24, 2022
Merged

Issue #931 : Support for Google OSV #1703

merged 28 commits into from
Jul 24, 2022

Conversation

sahibamittal
Copy link
Contributor

@sahibamittal sahibamittal commented Jun 10, 2022

Signed-off-by: Sahiba Mittal sahibamittal98@gmail.com

Issue: #931

Google OSV Repo -> https://github.com/google/osv

Frontend Changes: DependencyTrack/frontend#170

  1. Download vulnerabilities from Google OSV data dumps (via https https://github.com/google/osv#data-dumps).
    NOTE: Alternative way is to use Google client library to retrieve from bucket.
  2. Extract json files, parse and map data to dependency-track vulnerability objects.
  3. Save in database using query manager.

TBD: Mapping of OSV data to dependency-track data.
In Github advisory, we receive version range object whereas, in google OSV, we get an array of versions which can be used as range but on comparing (say, for ghsa id GHSA-vvw4-rfwf-p6hx), lower/upper ranges are bit different.

Objects from Github Advisory: https://docs.github.com/en/graphql/reference/objects#securityvulnerability
Objects from Google OSV: https://osv.dev/list?page=21&ecosystem=Maven

Example for difference between github and google for GHSA-vvw4-rfwf-p6hx :
Github: GHSA-vvw4-rfwf-p6hx
Google: attachment json
GHSA-vvw4-rfwf-p6hx.json.zip

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@sahibamittal
Copy link
Contributor Author

sahibamittal commented Jun 10, 2022

@nscuro @VinodAnandan

Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction you're headed in looks good, @sahibamittal.

[...] but on comparing (say, for ghsa id GHSA-vvw4-rfwf-p6hx), lower/upper ranges are bit different.

Can you elaborate a bit? I am not able to spot a difference in the ranges, but it's totally possible that I'm not looking close enough.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@stevespringett
Copy link
Member

@nscuro Anything outstanding prior to merge?

@nscuro
Copy link
Member

nscuro commented Jun 16, 2022

@stevespringett I haven't had a chance to review the latest changes, and haven't tested the new functionality yet. I'll try to get it done tonight.

@stevespringett
Copy link
Member

It doesn't appear that authentication is required in order to download the OSV ecosystem data. In the configuration, OSV is disabled by default.

Is this what the default should be? Are there reasons why OSV should not be enabled by default?

@sahibamittal
Copy link
Contributor Author

@nscuro I tried to run the container but it is showing 'Connection timed out' error at URL.openStream in OSVDownloadTask.
And so, it's not downloading the data. Do you know or faced this during github mirroring?

@sahibamittal
Copy link
Contributor Author

It doesn't appear that authentication is required in order to download the OSV ecosystem data. In the configuration, OSV is disabled by default.

Is this what the default should be? Are there reasons why OSV should not be enabled by default?

Yes, it should be enabled by default. The property value for enabled flag is coming as true from front-end. But, I'll change default to true in backend as well.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@nscuro
Copy link
Member

nscuro commented Jun 16, 2022

@sahibamittal I tried to run the container but it is showing 'Connection timed out' error at URL.openStream in OSVDownloadTask. And so, it's not downloading the data. Do you know or faced this during github mirroring?

Are you connected to a corporate network of some sorts? Those networks typically require usage of a proxy to be able to reach external hosts. I'm not super sure if URL.openStream honors system proxy configuration, but I assume it doesn't.

It may be a good idea to use HttpClientPool to acquire a HTTP client that does honor proxy configuration. You can find an example usage in NistMirrorTask:

try (final CloseableHttpResponse response = HttpClientPool.getClient().execute(request)) {
final StatusLine status = response.getStatusLine();
final long end = System.currentTimeMillis();
metricDownloadTime += end - start;
if (status.getStatusCode() == 200) {
LOGGER.info("Downloading...");
try (InputStream in = response.getEntity().getContent()) {
File temp = File.createTempFile(filename, null);

You may need to then configure the proxy, see "Proxy Configuration" section at https://docs.dependencytrack.org/getting-started/configuration/

For running the entire DT application, you don't need to build the container. I'd recommend to launch it via Maven, see: https://github.com/DependencyTrack/dependency-track/blob/a5037fa42e73343cab513df14af5475cbc67896c/DEVELOPING.md#debugging

@nscuro
Copy link
Member

nscuro commented Jun 16, 2022

One issue I noticed is that GHSA vulnerabilities are duplicated. We'll have vulnerabilities with vulnId GHSA-* with source=GITHUB, and source=GOOGLE. As a consequence, the internal analyzer will report such GHSA vulns twice. The OSS Index scanner does not suffer from this, b/c it checks for existing CVEs:

Vulnerability vulnerability = qm.getVulnerabilityByVulnId(
Vulnerability.Source.NVD, reportedVuln.getCve());

@stevespringett, will #1642 help here?

@stevespringett
Copy link
Member

@nscuro #1642 will help to an extent, yes. Still not sure how we're going to visualize this though.

@sahibamittal
Copy link
Contributor Author

One issue I noticed is that GHSA vulnerabilities are duplicated. We'll have vulnerabilities with vulnId GHSA-* with source=GITHUB, and source=GOOGLE. As a consequence, the internal analyzer will report such GHSA vulns twice. The OSS Index scanner does not suffer from this, b/c it checks for existing CVEs:

Vulnerability vulnerability = qm.getVulnerabilityByVulnId(
Vulnerability.Source.NVD, reportedVuln.getCve());

@stevespringett, will #1642 help here?

Can we may be disable github advisories and make OSV default as vulnerability source, to avoid duplication? since github source is a subset of OSV.

@nscuro
Copy link
Member

nscuro commented Jun 17, 2022

A viable option for sure, especially since that would also remove the necessity of configuring a PAT for GHSA mirroring.

Few things to check:

  • Does OSV contain all GitHub advisories, or only a subset?
  • How quickly are GitHub advisories ingested into OSV? A significant delay (say >1 day) would be suboptimal.

We'll also need to think of the migration scenario: Folks will already have lots of GHSA vulns in their portfolio, and they'll have audited them as well. We can't just remove the "old" GHSA vulns when upgrading, as audit trails will be lost.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@stevespringett
Copy link
Member

@nscuro to my knowledge, OSV only contains a subset of GHSA. They only include the "reviewed" vulnerabilities (approx 7500). They do not include the unreviewed vulnerabilities (>170K). In the future, I want to make a configurable option for the GitHub Advisories integration for enabling reviewed and unreviewed independently. This would allow folks to use OSV for all reviewed GHSAs and use the GitHub Advisory integration for all unreviewed vulnerabilities.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@stevespringett
Copy link
Member

Does OSV have any vulnerabilities that are specific to them? For example, Sonatype OSS Index, if what OSS Index found was a CVE, then the source is set to the NVD. If it was a sonatype specific finding, then the source is set to OSSINDEX.

We need to do the same here. If OSV finds a CVE, GHSA, etc, then the source needs to be set to the NVD and GITHUB accordingly, since they are the source of the vulnerability, not OSV. OSV is just acting as an interface.

But if OSV does have their own findings (which I think they do with OSS Fuzz), then OSV would only be the source for those things. This will also prevent the duplicates from occurring, and will prevent duplicate records being mirrored in the internal database.

@nscuro nscuro added this to the 4.6 milestone Jun 20, 2022
@VinodAnandan
Copy link
Contributor

OSV does have its own findings (e.g: https://osv.dev/vulnerability/OSV-2022-217 ) . OSV also supports alias field https://ossf.github.io/osv-schema/#aliases-field. I believe that the OSV data may contain more useful information ("PURLs, expand version ranges to a full list of affected versions, etc." ) @oliverchang please correct me if I'm wrong.

I think duplicate findings can be correlated and displayed as a unified finding from multiple sources? I think in this way, we can identify any data issues with a particular source.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahibamittal, have you by chance tested how this behaves when GHSA mirroring is enabled as well? Will the GHSA mirror task override the vulnerabilities created by OSV, and the other way around?

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback. Great work so far! But I'll need a little bit more time to complete the review and do some testing.

Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
Signed-off-by: Sahiba Mittal <sahibamittal98@gmail.com>
@sahibamittal sahibamittal requested a review from nscuro July 1, 2022 14:03
Signed-off-by: nscuro <nscuro@protonmail.com>
Signed-off-by: nscuro <nscuro@protonmail.com>
…fields

Signed-off-by: nscuro <nscuro@protonmail.com>
Additional changes:

* Rename `OsvVulnerability` to `OsvAffectedPackage` to avoid confusion
* Be more strict about ordering of range events

Signed-off-by: nscuro <nscuro@protonmail.com>
@nscuro
Copy link
Member

nscuro commented Jul 2, 2022

As requested by @VinodAnandan, I made some changes to address some outstanding issues.

  • Fix Issue #931 : Support for Google OSV #1703 (comment)
  • Prevent infinite loops when parsing version ranges
  • Be a little bit more strict when parsing range events (always require an introduced event as specified in the spec)
  • For GitHub advisories, handle < constraints in addition to <=
  • Rename OsvVulnerability to OsvAffectedPackage to avoid confusion with other vulnerability types

Even before these changes, the code in this PR worked as intended. To demonstrate, using the OSV data, the internal analyzer was able to identify pretty much the same vulnerabilities that OSS Index found:

2022-07-03 15_56_01-Dependency-Track - Dependency-Track ▸ 4 4 2 — Mozilla Firefox

I noticed a few general limitations however. Noting them here for completeness:

  1. Once OSV and NVD mirroring completes for the first time (OSV takes 20-30sec for me in total), there's a massive performance degradation in the OSV mirror task on following executions. The culprit being the VulnerabilityQueryManager#getVulnerabilityByVulnId method. More specifically the fetching of associated VulnerableSoftware instances (this happens implicitly). It takes multiple seconds per execution, and is executed for every single vulnerability. I'm counting a total of 352642 rows in my local VULNERABLESOFTWARE table. It's possible that there's an important index missing, but I need more time to investigate. Also, this bottleneck was not introduced in this PR, it just surfaces now. Resolved via c6c687e

  2. Parsing version ranges can really only be done on a best-effort basis. The affected[].ranges[].events array in OSV is not guaranteed to be sorted in any meaningful way. We just assume that now, because we have no other choice. Order of events is not required to answer the "is my version affected?" question (see evaluation algorithm), but because we're mirroring the database instead of using the OSV API we're always in danger of ingesting wrong or ambiguous data.

  3. Some vulnerability databases that OSV supports can't currently express all the constraints they need with OSV standard fields. They have to implement workarounds using non-standard fields, making interpretation of version ranges extra hard. We saw that with GitHub (Enable GHSA to consolidate affected entries ossf/osv-schema#35, Support limit and last_affected range events github/advisory-database#470), but there may be others.

  4. Ingesting affected data (VulnerableSoftware in DT) is currently an append-only operation. If an advisory "fixes" its affected version ranges by removing some of them, DT will not replicate this. Same goes for other corrective operations that may involve removing or modifying VulnerableSoftwares.

  5. More generally, fixing VulnerableSoftware entries after the fact is pretty much impossible and most likely requires deleting everything and having it re-populated by the respective mirroring tasks. Not the end of the world, but due to this I'm very cautious when it comes to ingesting data we can't predict reliably.

Points 2 and 3 bother me the most. While my local testing looked good WRT identifying vulnerabilities, I think it's safe to say that OSV integration will not yield the accurate data we were hoping for, at least not right now.

If we end up merging this, we should make it really clear that this is a preview feature. Not due to the implementation, but because mapping from OSV to our data model is error-prone.

May be worth investigating whether just using OSV's API would be a better approach for now. That way we can use their documented evaluation algorithm, and only mirror vulnerabilities ad-hoc, like it's done for OSS Index already. I hacked together a MVP to demonstrate what I mean: nscuro@d9fc2ee

I guess my open question is: Are we fine with the uncertainties and unconsistencies that we may introduce by mirroring OSV?

For some odd reason, the query generated by DataNucleus for fetching `VulnerableSoftware` is drastically less efficient when using the `VULNERABLESOFTWARE` `@FetchGroup` over lazy fetching via `Vulnerability#getVulnerableSoftware()`.

Query generated by fetch group:

```
SELECT 'org.dependencytrack.model.VulnerableSoftware' AS DN_TYPE,A1.CPE22,A1.CPE23,A1.EDITION,A1.ID AS NUCORDER0,A1."LANGUAGE",A1.OTHER,A1.PART,A1.PRODUCT,A1.PURL,A1.PURL_NAME,A1.PURL_NAMESPACE,A1.PURL_QUALIFIERS,A1.PURL_SUBPATH,A1.PURL_TYPE,A1.PURL_VERSION,A1.SWEDITION,A1.TARGETHW,A1.TARGETSW,A1."UPDATE",A1.UUID,A1.VENDOR,A1.VERSION,A1.VERSIONENDEXCLUDING,A1.VERSIONENDINCLUDING,A1.VERSIONSTARTEXCLUDING,A1.VERSIONSTARTINCLUDING,A1.VULNERABLE,A0.VULNERABILITY_ID FROM VULNERABLESOFTWARE_VULNERABILITIES A0 INNER JOIN VULNERABLESOFTWARE A1 ON A0.VULNERABLESOFTWARE_ID = A1.ID WHERE EXISTS (SELECT 'org.dependencytrack.model.Vulnerability' AS DN_TYPE,A0_SUB.ID AS DN_APPID FROM VULNERABILITY A0_SUB WHERE A0_SUB.SOURCE = 'NVD' AND A0_SUB.VULNID = 'CVE-2020-0404' AND A0.VULNERABILITY_ID = A0_SUB.ID) ORDER BY NUCORDER0
```

Query generated by `getVulnerableSoftware()`:

```
SELECT 'org.dependencytrack.model.VulnerableSoftware' AS DN_TYPE,A1.CPE22,A1.CPE23,A1.EDITION,A1.ID AS NUCORDER0,A1."LANGUAGE",A1.OTHER,A1.PART,A1.PRODUCT,A1.PURL,A1.PURL_NAME,A1.PURL_NAMESPACE,A1.PURL_QUALIFIERS,A1.PURL_SUBPATH,A1.PURL_TYPE,A1.PURL_VERSION,A1.SWEDITION,A1.TARGETHW,A1.TARGETSW,A1."UPDATE",A1.UUID,A1.VENDOR,A1.VERSION,A1.VERSIONENDEXCLUDING,A1.VERSIONENDINCLUDING,A1.VERSIONSTARTEXCLUDING,A1.VERSIONSTARTINCLUDING,A1.VULNERABLE FROM VULNERABLESOFTWARE_VULNERABILITIES A0 INNER JOIN VULNERABLESOFTWARE A1 ON A0.VULNERABLESOFTWARE_ID = A1.ID WHERE A0.VULNERABILITY_ID = ? ORDER BY NUCORDER0
```

Signed-off-by: nscuro <nscuro@protonmail.com>
@VinodAnandan
Copy link
Contributor

I think there are pros and cons for each approach (individual calls vs batch/mirroring). Considering the current state (New, relatively small VDB, No SLA, etc.) of OSV.dev I believe the batch/mirroring approach is better for now, but we can revisit it later if needed. Also, the mirroring processing can be improved if we can incrementally get the data.

I don't believe the affected data problem is just associated with OSV.dev, any vulnerability database may contain inaccurate affected data, and we can't expect perfect affected data from any source. The DT should support correcting the affected data over time by removing the append-only restriction. DT should associate the affected data to a source and the source be allowed to update the data if needed. Otherwise it may result in more FPs even if the vulnerability data source eventually corrects the data.

@nscuro
Copy link
Member

nscuro commented Jul 18, 2022

@VinodAnandan Agreed. We need a way to track which source added a VulnerableSoftware entry, to then be able to remove it again once it's not reported by any source anymore. Corrections in VDBs happen frequently, and we need to consider that.

I don't believe the affected data problem is just associated with OSV.dev, any vulnerability database may contain inaccurate affected data, and we can't expect perfect affected data from any source.

+1. However, my main concern with OSV isn't that their data isn't super accurate, it's more that me may not be able to correctly parse / interpret the data they're providing (see the ranges issue above).

@oliverchang
Copy link

+1. However, my main concern with OSV isn't that their data isn't super accurate, it's more that me may not be able to correctly parse / interpret the data they're providing (see the ranges issue above).

For affected[].ranges[].events, OSV.dev can perform a best effort sort when aggregating them. We have google/osv.dev#485 open for this.

For github/advisory-database#470, for vulnerabilities without a "fixed" event, we interpret that as every version is affected right now. This is actually not too big of an issue most of the time since GitHub tracks the patched version most of the time -- if one exists, there will be a "fixed" event. If not, most likely it's not fixed. Either way, GitHub has the issue open on their side to make use of the last_affected event (thanks @nscuro for filing that issue!)

@nscuro
Copy link
Member

nscuro commented Jul 18, 2022

Thanks @oliverchang, google/osv.dev#485 would be really nice to have!

Copy link
Member

@nscuro nscuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all the work you put into this, @sahibamittal!

I'm going to merge this now, and I logged #1815 to address the VulnerableSoftware situation.

We still need documentation for the OSV datasource (similar to the GHSA one). The docs should also state that this is a preview feature. Could you provide a PR for this, or do you want us to do it?

@nscuro nscuro merged commit e2a5ad4 into DependencyTrack:master Jul 24, 2022
@nscuro nscuro removed this from the 4.6 milestone Jul 24, 2022
@sahibamittal
Copy link
Contributor Author

Thanks Niklas!
I'll work on the documentation. 👍🏼

@sahibamittal sahibamittal deleted the google-osv-support branch July 25, 2022 11:23
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants