Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conformance tests to validate OCI devices are disallowed #2973

Closed
dgerd opened this issue Jan 23, 2019 · 8 comments · Fixed by #4074
Closed

Add conformance tests to validate OCI devices are disallowed #2973

dgerd opened this issue Jan 23, 2019 · 8 comments · Fixed by #4074
Assignees
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/spec Discussion of how a feature should be exposed to customers.
Milestone

Comments

@dgerd
Copy link

dgerd commented Jan 23, 2019

Expected Behavior

Only default OCI devices are allowed by the knative runtime contract. We should error when a non-default device is requested.

https://github.com/knative/serving/blob/master/docs/runtime-contract.md#devices

Actual Behavior

We need a conformance test to validate this behavior.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/good-first-issue kind/spec Discussion of how a feature should be exposed to customers. labels Jan 23, 2019
@dgerd
Copy link
Author

dgerd commented Feb 21, 2019

/assign @dgerd

@dgerd
Copy link
Author

dgerd commented Mar 7, 2019

Looking at how this works today, Kubernetes does not support passing Devices through directly to the OCI interface ( see kubernetes/kubernetes#5607 ).

There is support for specifying devices using Device Plugins ( https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#examples ), but these seem to be enabled on the kubelet and not on the Pod. Device Plugins have custom ResourceNames that can be requested in the resource block similar to cpu and memory. However, looking at examples, when resources are not requested they still may be made available to the container (https://github.com/NVIDIA/k8s-device-plugin#running-gpu-jobs).

This results in the following questions from the statement in the runtime contract:

  1. It is unclear if we require Knative to fulfill the OCI Contract with regards to Default Devices. Should we validate that Default Devices are always present?
  2. There is no mention of Platform Provider or Operator. Can a "Platform Provider" or "Operator" make additional devices (not part of the "Default Devices" list) available such as GPUs or Network devices? If so, how should a Developer signal that they need one or are they assumes to be always present?

My current understanding of the problem and the work to meet the requirements is:

  1. We should only allow CPU and Memory to be specified in the resources block (i.e. No custom ResourceNames to request additional devices).
  2. We should validate via Conformance that any other resources provided are blocked by the webhook.

@evankanderson Any thoughts here?


https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#default-devices

@evankanderson
Copy link
Member

Hmm, I'm wondering whether we should change that MUST to a SHOULD (thinking specifically about the GPU accelerator case).

In any case, I think the developer "MUST" indicates that containers which violate those expectations may experience undefined behavior (aka "nasal demons").

@dgerd
Copy link
Author

dgerd commented Mar 14, 2019

The problem I have with nasal demons on MUST/SHOULD NOT statements is that they become extremely hard to test in the context of conformance. If an instance of Knative allows this behavior is it still conformant? My assumption is yes it is, but there is no useful test that I can come up with here as allowing and disallowing both seem valid.

If we believe that GPUs are an important use-case to cover then I am not sure that adding a test or a webhook validation is the right step forward, but rather I think this issue should be to remove the statement from the runtime contract or move the statement to a new document that contains developer best practices.

@evankanderson
Copy link
Member

I'm not sure how to capture this in a testable way, but the intent of the developer requirements is to place "outer bounds" on what conformance should test. I.e. a conformance test could validate the following devices per the OCI spec:

/dev/

  • null
  • /dev/zero
  • /dev/full
  • /dev/random
  • /dev/urandom
  • /dev/tty
  • /dev/console is set up if terminal is enabled in the config by bind mounting the pseudoterminal slave to /dev/console.
  • /dev/ptmx. A bind-mount or symlink of the container's /dev/pts/ptmx.

It would be unreasonable for a test to attempt to run CUDA code, as a conformant implementation might not expose the appropriate device (or even have access to the appropriate nvidia hardware). I'd certainly be open to rephrasing this area and other areas of the specification if you have a suggestion about how to make this testable.

@dgerd
Copy link
Author

dgerd commented Mar 15, 2019

Looking at your example it seems like we could turn the statement around to express requirements of the platform rather than requirements of the developer. Statements about the platform seem easier to grok and programmatically verify in the context of conformance. Is there a difference in your mind between the two statements below given the context above:

OCI "Default" Devices MUST be provided

and

Developers SHOULD NOT use OCI devices to request additional devices beyond the OCI specification "Default Devices".

Given that we will not actively block or limit device specification in our API, both statements to me leave the requirement of exposing additional OCI devices up to the platform provider.

@mattmoor mattmoor modified the milestones: Serving 0.5, Serving 0.6 Mar 19, 2019
@evankanderson
Copy link
Member

evankanderson commented Mar 22, 2019 via email

dgerd pushed a commit to dgerd/serving that referenced this issue May 7, 2019
This change makes numerous cleanups to the runtime contract in an
attempt to improve the readability of the document and make the document
more useful for the intended auidence.

* Moves developer facing statements to a new `runtime-user-guide`.
Focuses `runtime-contract` on operator/platform-provider.
* Add links to Conformance tests that test Runtime Contract statements.
* Corrects, updates, or removes statements to more accurately represent
today's Knative runtime.
* Updates to informative or removes most untestable statements
* Copies in important OCI runtime requirements we previously referenced
* Removes reference to OCI specification that didn't bring new
requirements.

Ref: knative#2539, knative#2973, knative#4014, knative#4027
@dgerd
Copy link
Author

dgerd commented May 8, 2019

#4035 updates this to the wording suggested by Evan. We now need to implement the tests that check the presence of the default devices.

dgerd pushed a commit to dgerd/serving that referenced this issue May 10, 2019
This adds checks for the default OCI devices to our conformance test for
filesystem validation. This test also refactors where the file paths to
check are located to reduce the number of transformations and simplify
adding additional paths.

Fixes knative#2973
knative-prow-robot pushed a commit that referenced this issue May 10, 2019
* Add additional filesystem checks for OCI devices

This adds checks for the default OCI devices to our conformance test for
filesystem validation. This test also refactors where the file paths to
check are located to reduce the number of transformations and simplify
adding additional paths.

Fixes #2973

* Fix comments

* Code review comments
JRBANCEL pushed a commit to JRBANCEL/serving that referenced this issue May 29, 2019
* Add additional filesystem checks for OCI devices

This adds checks for the default OCI devices to our conformance test for
filesystem validation. This test also refactors where the file paths to
check are located to reduce the number of transformations and simplify
adding additional paths.

Fixes knative#2973

* Fix comments

* Code review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/spec Discussion of how a feature should be exposed to customers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants