Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KF 1.0 Compliance] Vulnerability Scanning #3857

Closed
Bobgy opened this issue May 27, 2020 · 24 comments
Closed

[KF 1.0 Compliance] Vulnerability Scanning #3857

Bobgy opened this issue May 27, 2020 · 24 comments
Assignees
Labels
area/deployment/kubeflow kind/misc types beside feature and bug lifecycle/frozen priority/p1 status/triaged Whether the issue has been explicitly triaged

Comments

@Bobgy
Copy link
Contributor

Bobgy commented May 27, 2020

Part of #2884

Docker images must be scanned for vulnerabilities and known vulnerabilities published

@jlewi Do you know how other images share vulnerability issues?

I did a quick investigation, gcr.io provides vulnerability scanning, but the result is not visible to external visitors even if the image is public.

We can export the generated yaml report with commands like

gcloud beta container images describe --show-package-vulnerability gcr.io/ml-pipeline/api-server:1.0.0-test-5

Documented in https://cloud.google.com/container-registry/docs/get-image-vulnerabilities

Do you think that's good enough?

@Bobgy Bobgy self-assigned this May 27, 2020
@Bobgy Bobgy added status/triaged Whether the issue has been explicitly triaged kind/misc types beside feature and bug labels May 27, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented May 29, 2020

@jbottum Do you have any ideas about this?

@jlewi
Copy link
Contributor

jlewi commented May 29, 2020

kubeflow/kubeflow#3907 is tracking how we publish a list of vulnerabilities in our images.

A related issue is minimizing vulnerabilities e.g. by using distroless images.
There is documentation at
https://github.com/krishnadurai/community/blob/b1669588d785455a1e4e4cab456e03c08a05af7c/guidelines/creating_dockerfiles.md

Note the use of distroless images is recommended not a requirement.

kubeflow/kubeflow#4590 is a related issue about promoting the use of distroless in Kubeflow to minimize vulnerabilities.

To satisfy the vulnerability scanning requirement I think you just need to turn on vulnerability scanning in whatever GCR registry you are hosting your images in.

You might want to repurpose this issue or file a new one for reducing vulnerabilities if relevant.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 1, 2020

@jlewi As reported in the kubeflow/kubeflow#3907, if we enable gcr vulnerability scanning, they are not visible for external viewers.
So in addition to that we'd still need to dump a yaml report for each KFP release, sounds reasonable?

@Bobgy
Copy link
Contributor Author

Bobgy commented Jun 1, 2020

Thanks for the relevant link to reducing vulnerability. I'll create a separate issue about it.

@stale
Copy link

stale bot commented Aug 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Aug 30, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 31, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Aug 31, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Oct 29, 2020

An example of fixing some vulnerability issues: #4531

some related readings:

My take aways:

  • it's impossible fixing all vulnerabilities and some (or probably many) can actually be false positives, so we need to
  • constantly update base images to get upstream fixes
  • if there hasn't been a fix in upstream, we need to review the vulnerability and see if it really matters to us, then act accordingly

going forward, we should:

  • utilize distroless images as much as possible (because they have near 0 vulnerability from the base image)
  • when not feasible, constantly update the base image to get vulnerability fixes
  • act in ad-hoc for remaining high/critical vulnerability that people care about

@Bobgy
Copy link
Contributor Author

Bobgy commented Oct 29, 2020

AIs:

  • formalize a vulnerability management process
  • understand current image vulnerability status and triage urgent fixes
  • build the needed vulnerability scanning automation that flag High and Critical issues (p1) before release and send vulnerability reports (p2) for each KFP release.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

Requests to reduce vulnerabilities come more often than before, so I'm taking some time to continue this.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

Formalize a vulnerability management process

I think the process should come with two parts:

  1. Set up a process to update dependencies/base images more frequently.
    This is already being addressed in [Project Health] Dependency upgrade process #4682

  2. Add an automated vulnerability policy check step in our CI/CD pipelines.
    In the pipeline, we'll unavoidably need to allowlist many CVEs (maybe even of high/critical level), because a fix may not have been released, or the CVE may not be exploitable in KFP use-case, or maybe risk is tolerable. We should add comment on this whitelist about the reasons, and mark some of them as TODOs.

I'll focus on 2. in this issue.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

Research of tools suitable for this need:

  • Google Cloud provides vulnerability scanning in container analysis service, but it can only provide information for what we need. It lacks required tools to integrate in a CI/CD pipeline. https://cloud.google.com/container-analysis/docs/vulnerability-scanning
  • Kritis is a nice tool built by GCP, https://cloud.google.com/binary-authorization/docs/creating-attestations-kritis#check-only. It supports vulnerability policy like the following and integrates with data from Google Container Analysis:
    apiVersion: kritis.grafeas.io/v1beta1
    kind: VulnzSigningPolicy
    metadata:
      name: my-vsp
    spec:
      imageVulnerabilityRequirements:
        maximumFixableSeverity: MEDIUM
        maximumUnfixableSeverity: MEDIUM
        allowlistCVEs:
        - projects/goog-vulnz/notes/CVE-2020-10543
        - projects/goog-vulnz/notes/CVE-2020-10878
        - projects/goog-vulnz/notes/CVE-2020-14155
    

Using them combined seem to meet our basic needs.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

There seems to be similar open source tools like https://github.com/arminc/clair-scanner, but it requires running your own vulnerability server. It's more convenient to use GCP container analysis service directly.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

A bit more research lead me to https://github.com/aquasecurity/trivy. It seems the leading open source option.
There are some extra nice features:

  1. a local CLI for exploration -- it can group CVEs by library type:

    $ trivy image knqyf263/vuln-image:1.2.3
    2019-05-16T12:59:03.150+0900    INFO    Detecting Alpine vulnerabilities...
    2019-05-16T12:59:04.941+0900    INFO    Detecting bundler vulnerabilities...
    2019-05-16T12:59:05.967+0900    INFO    Detecting cargo vulnerabilities...
    2019-05-16T12:59:07.834+0900    INFO    Detecting composer vulnerabilities...
    2019-05-16T12:59:10.285+0900    INFO    Detecting npm vulnerabilities...
    2019-05-16T12:59:11.487+0900    INFO    Detecting pipenv vulnerabilities...
    
    knqyf263/vuln-image:1.2.3 (alpine 3.7.1)
    ========================================
    Total: 26 (UNKNOWN: 0, LOW: 3, MEDIUM: 16, HIGH: 5, CRITICAL: 2)
    
    +---------+------------------+----------+-------------------+---------------+----------------------------------+
    | LIBRARY | VULNERABILITY ID | SEVERITY | INSTALLED VERSION | FIXED VERSION |              TITLE               |
    +---------+------------------+----------+-------------------+---------------+----------------------------------+
    | curl    | CVE-2018-14618   | CRITICAL | 7.61.0-r0         | 7.61.1-r0     | curl: NTLM password overflow     |
    |         |                  |          |                   |               | via integer overflow             |
    +         +------------------+----------+                   +---------------+----------------------------------+
    |         | CVE-2018-16839   | HIGH     |                   | 7.61.1-r1     | curl: Integer overflow leading   |
    |         |                  |          |                   |               | to heap-based buffer overflow in |
    |         |                  |          |                   |               | Curl_sasl_create_plain_message() |
    +         +------------------+          +                   +---------------+----------------------------------+
    |         | CVE-2019-3822    |          |                   | 7.61.1-r2     | curl: NTLMv2 type-3 header       |
    |         |                  |          |                   |               | stack buffer overflow            |
    +         +------------------+          +                   +---------------+----------------------------------+
    |         | CVE-2018-16840   |          |                   | 7.61.1-r1     | curl: Use-after-free when        |
    |         |                  |          |                   |               | closing "easy" handle in         |
    |         |                  |          |                   |               | Curl_close()                     |
    +         +------------------+----------+                   +               +----------------------------------+
    |         | CVE-2018-16842   | MEDIUM   |                   |               | curl: Heap-based buffer          |
    |         |                  |          |                   |               | over-read in the curl tool       |
    |         |                  |          |                   |               | warning formatting               |
    +         +------------------+          +                   +---------------+----------------------------------+
    |         | CVE-2018-16890   |          |                   | 7.61.1-r2     | curl: NTLM type-2 heap           |
    |         |                  |          |                   |               | out-of-bounds buffer read        |
    +         +------------------+          +                   +               +----------------------------------+
    |         | CVE-2019-3823    |          |                   |               | curl: SMTP end-of-response       |
    |         |                  |          |                   |               | out-of-bounds read               |
    +---------+------------------+----------+-------------------+---------------+----------------------------------+
    | git     | CVE-2018-17456   | HIGH     | 2.15.2-r0         | 2.15.3-r0     | git: arbitrary code execution    |
    |         |                  |          |                   |               | via .gitmodules                  |
    +         +------------------+          +                   +               +----------------------------------+
    |         | CVE-2018-19486   |          |                   |               | git: Improper handling of        |
    |         |                  |          |                   |               | PATH allows for commands to be   |
    |         |                  |          |                   |               | executed from...                 |
    +---------+------------------+----------+-------------------+---------------+----------------------------------+
    ...
    
  2. there are existing github actions that use trivy: https://github.com/Azure/container-scan

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

For reference, vulnerability vector description:
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

An experimental feature of trivy is to use user defined open agent policy as checker for the vulnerabilities.
It can be used to filter based on vulnerability vector,
examples include:

  • ignore all vulnerabilities that cannot be exploited via network
  • ignore those that cannot be exploited with root permission
  • ...

So it can reduce the amount of vulnerabilities we need to check based on our specific environment requirements.

References:

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

EDIT: what's described below doesn't work well, because the result of gcloud beta container images describe --show-package-vulnerability gcr.io/ml-pipeline/api-server:1.0.0-test-5 --format=json does not provide information on vulnerability vector.

Open Policy Agent is in fact a generic tool:

inputs: "JSON" and "Policy"
output: "pass?"

So we could just use it with gcr vulnerability scanning to get the best of both flexibility using a GCP managed service.

==

or alternatively we can just write a script to check the vulnerability JSON as our own policy.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

Analysis of Options

Trivy

  • Onboarding cost: low (download a binary and run it)
  • Vulnerability DB confidence: unknown (it's a third party maintained DB, although it claims its sources are the common ones like NVD etc)
  • Configuration flexibility: high (especially with OPA)
  • Momentum: high (6k stars, 18 PRs merged and 12 issues closed last month -- at time of evaluation)

Kritis

  • Onboarding cost: low (there're official docs for using it in Cloud Build, it's a container)
  • Vulnerability DB confidence: very high (it uses GCP image scanning)
  • Configuration flexibility: medium (allowlist + filter by [fixable, severity])
  • Momentum: low (the repo have 0 new activities recently)

Other options look obviously worse than the two, so I'm leaving them out.

To note that, OPA looks like it has some learning curve because there's a new language to learn, so I'd prefer we stay away from it initially. Therefore, if not using OPA, Trivy's major advantage does not apply to us.

I think we can start with Kritis, if it proves to work as it is, we can delay further customization when we really need to.
If we discover blocking bugs, we can revisit Trivy as a backup plan.

@shawnzhu
Copy link
Member

shawnzhu commented Jan 31, 2021

I'm interested in this issue. speaking of trivy, it supports filtering vulnerabilities by a number of options besides OPA:

  1. --severity - https://github.com/aquasecurity/trivy#filter-the-vulnerabilities-by-severities
  2. .trivyignore (ignore spedific vulnerabilities) - https://github.com/aquasecurity/trivy#ignore-the-specified-vulnerabilities
  3. --skip-files - https://github.com/aquasecurity/trivy#skip-traversal-of-the-specific-files
  4. --skip-dirs - https://github.com/aquasecurity/trivy#skip-traversal-in-the-specific-directory

the lack of activity of Kritis might be a problem, but willing to give it a try since I haven't use it before.

@Bobgy
Copy link
Contributor Author

Bobgy commented Jan 31, 2021

@shawnzhu You are right.

I didn't make it clear that my major preference for kritis is -- it uses GCP container scanning as data source (in fact, it directly reads GCP container scanning results, so you cannot use it outside GCP)

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 1, 2021

Some notes after experimenting with Kritis:

  • Although the official sample is in Cloud Build, I found it much faster in terms of developer speed writing a KFP pipeline that runs vulnerability checks using Kritis
  • Kritis does not output structured information for vulnerability check results, we can only look at its logs like

    E0201 01:43:02.099893 1 main.go:211] found fixable CVE <redacted> in gcr.io/<redacted>, which has severity HIGH exceeding max fixable severity MEDIUM

@Bobgy
Copy link
Contributor Author

Bobgy commented Feb 1, 2021

I built a KFP pipeline that runs Kritis: #5066.
This is now a one off pipeline I use to verify existing released images.

P1 The next steps would be maintaining a long running KFP test cluster and run that pipeline as one of the post submit tests.

@davidspek
Copy link
Contributor

There seems to be similar open source tools like https://github.com/arminc/clair-scanner, but it requires running your own vulnerability server. It's more convenient to use GCP container analysis service directly.

@Bobgy I think this is a better link: https://github.com/quay/clair. Clair is what Amazon ECR uses: https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html.

@chensun chensun moved this to Post-CP12 in KFP v2 Feb 27, 2023
@chensun chensun added this to KFP v2 Feb 27, 2023
@rimolive
Copy link
Member

Security WG created a vulnerability scan for all Kubeflow images, including pipelines. This issue is not needed anymore.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Security WG created a vulnerability scan for all Kubeflow images, including pipelines. This issue is not needed anymore.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deployment/kubeflow kind/misc types beside feature and bug lifecycle/frozen priority/p1 status/triaged Whether the issue has been explicitly triaged
Projects
Status: Post-CP12
Development

No branches or pull requests

6 participants