Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension to support new compute resources #368

Closed
vishh opened this issue Jul 30, 2017 · 15 comments · Fixed by kubernetes/kubernetes#51660
Closed

Extension to support new compute resources #368

vishh opened this issue Jul 30, 2017 · 15 comments · Fixed by kubernetes/kubernetes#51660
Assignees
Labels
area/hw-accelerators kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/beta Denotes an issue tracking an enhancement targeted for Beta status
Milestone

Comments

@vishh
Copy link
Contributor

vishh commented Jul 30, 2017

Feature Description

  • One-line feature description (can be used as a release note):
    New Extension at the node level to surface, schedule and manage lifecycle of new compute resources.
  • Primary contact (assignee): jiayingz@
  • Responsible SIGs: node
  • Design proposal link (community repo): Device Plugin Design Proposal community#695
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred: derekwaynecarr@, jiayingz@
  • Approver (likely from SIG/area to which feature belongs): vishh@
  • Feature target (which target equals to which milestone):
    • Alpha release target (1.8)
    • Beta release target (1.9), actual (1.10)
    • Stable release target (1.11)
@vishh vishh added this to the 1.8 milestone Jul 30, 2017
@vishh vishh changed the title Alpha extension to support new compute resources Extension to support new compute resources Jul 30, 2017
@jdumars
Copy link
Member

jdumars commented Aug 1, 2017

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 1, 2017
@vishh vishh added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Aug 1, 2017
@RenaudWasTaken
Copy link

RenaudWasTaken commented Aug 10, 2017

Progress Tracker

@fabiand
Copy link

fabiand commented Aug 11, 2017

@RenaudWasTaken should the resource class design also be part of this tracker? Just asking because I'm missing it in the progress tracker.

@RenaudWasTaken
Copy link

RenaudWasTaken commented Aug 11, 2017

@fabiand the resource class design is a beta (1.9 and up) feature :)

My progress tracker is designed to track all the alpha (1.8) features for now.
I'll update it later (after alpha) once we agree on the feature we want to see in beta

@idvoretskyi
Copy link
Member

@vishh @kubernetes/sig-node-feature-requests unfortunately, @jiayingz can't be assigned to the feature as he is not a member of Kubernetes org. Can someone shadow him in this role?

@vishh
Copy link
Contributor Author

vishh commented Aug 14, 2017

I'm shepherding this feature. So assigning to myself for now.

@vishh vishh self-assigned this Aug 14, 2017
@idvoretskyi
Copy link
Member

idvoretskyi commented Aug 14, 2017 via email

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Aug 17, 2017
Automatic merge from submit-queue (batch tested with PRs 49342, 50581, 50777)

Device Plugin Protobuf API

**What this PR does / why we need it:**
This implements the Device Plugin API

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

Special notes for your reviewer:

First proposal submitted to the community repo, please advise if something's not right with the format or procedure, etc.
@vishh @derekwaynecarr

**Release note:**
```
NONE
```
tjfontaine pushed a commit to oracle/kubernetes that referenced this issue Aug 21, 2017
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz 

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Aug 24, 2017
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Sep 2, 2017
Automatic merge from submit-queue (batch tested with PRs 51590, 48217, 51209, 51575, 48627)

Deviceplugin jiayingz

**What this PR does / why we need it**:
This PR implements the kubelet Device Plugin Manager.
It includes four commits implemented by @RenaudWasTaken and a commit that supports allocation.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #
Design document: kubernetes/community#695
PR tracking: kubernetes/enhancements#368

**Special notes for your reviewer**:

**Release note**:
Extending Kubelet to support device plugin

```release-note
```
@RenaudWasTaken
Copy link

Pretty sure this should not be closed :D
@vishh ^

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Nov 13, 2017
Automatic merge from submit-queue (batch tested with PRs 54826, 53576, 55591, 54946, 54825). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Run nvidia-gpu device-plugin daemonset as an addon on GCE nodes that have nvidia GPUs attached

- Instead of the old `Accelerators` feature that added `alpha.kubernetes.io/nvidia-gpu` resource, use the new `DevicePlugins` feature that adds vendor specific resources. (In case of nvidia GPUs it will
add `nvidia.com/gpu` resource.)

- Add node label to GCE nodes with accelerators attached. This node label is the same as what GKE attaches to node pools with accelerators attached. (For example, for nvidia-tesla-p100 GPU, the label would be `cloud.google.com/gke-accelerator=nvidia-tesla-p100`) This will help us target accelerator specific
daemonsets etc. to these nodes.

- Run nvidia-gpu device-plugin daemonset as an addon on GCE nodes that have nvidia GPUs attached.

- Some minor documentation improvements in addon manager.

**Release note**:
```release-note
GCE nodes with NVIDIA GPUs attached now expose `nvidia.com/gpu` as a resource instead of `alpha.kubernetes.io/nvidia-gpu`.
```

/sig cluster-lifecycle
/sig scheduling
/area hw-accelerators

kubernetes/enhancements#368
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Nov 14, 2017
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Extend test/e2e/scheduling/nvidia-gpus.go to track resource usage of

installer and device plugin containers.
To support this, exports certain functions and fields in
framework/resource_usage_gatherer.go so that it can be used in any
e2e test to track any specified pod resource usage with the specified
probe interval and duration.



**What this PR does / why we need it**:
We need to quantify the resource usage of the device plugin DaemonSet to make sure it can run reliably on nodes with GPUs.
We also want to measure gpu driver installer resource usage to track any unexpected resource consumption during driver installation.
For the later part, see a related issue kubernetes/enhancements#368.

Example resource summary output:
Oct  6 12:35:07.289: INFO: Printing summary: ResourceUsageSummary
Oct  6 12:35:07.289: INFO: ResourceUsageSummary JSON
{
  "100": [
    {
      "Name": "nvidia-device-plugin-6kqxp/nvidia-device-plugin",
      "Cpu": 0.000507167,
      "Mem": 2134016
    },
    {
      "Name": "nvidia-device-plugin-6kqxp/nvidia-driver-installer",
      "Cpu": 1.915508718,
      "Mem": 663330816
    },
    {
      "Name": "nvidia-device-plugin-l28zc/nvidia-device-plugin",
      "Cpu": 0.000836256,
      "Mem": 2211840
    },
    {
      "Name": "nvidia-device-plugin-l28zc/nvidia-driver-installer",
      "Cpu": 1.916886293,
      "Mem": 691449856
    },
    {
      "Name": "nvidia-device-plugin-xb4vh/nvidia-device-plugin",
      "Cpu": 0.000515103,
      "Mem": 2265088
    },
    {
      "Name": "nvidia-device-plugin-xb4vh/nvidia-driver-installer",
      "Cpu": 1.909435982,
      "Mem": 832430080
    }
  ],
  "50": [
    {
...

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #

**Special notes for your reviewer**:

**Release note**:

```release-note
```
@WanLinghao
Copy link

@RenaudWasTaken @vishh Hello,I would like to pick up the e2e test if no one already did.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Dec 20, 2017
Automatic merge from submit-queue (batch tested with PRs 56681, 57384). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Deprecate the alpha Accelerators feature gate.

Encourage people to use DevicePlugins instead.

/kind cleanup

Related to kubernetes/enhancements#192 and kubernetes/enhancements#368

**Release note**:
```release-note
The alpha Accelerators feature gate is deprecated and will be removed in v1.11. Please use device plugins instead. They can be enabled using the DevicePlugins feature gate.
```

/sig node
/sig scheduling
/area hw-accelerators
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2018
@derekwaynecarr
Copy link
Member

device plugins graduated to beta in 1.10.

@derekwaynecarr derekwaynecarr added stage/beta Denotes an issue tracking an enhancement targeted for Beta status and removed stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels Mar 27, 2018
@justaugustus
Copy link
Member

@vishh @jiayingz @derekwaynecarr @kubernetes/sig-node-feature-requests
Any plans for this in 1.11?

If so, can you please ensure the feature is up-to-date with the appropriate:

  • Description
  • Milestone
  • Assignee(s)
  • Labels:
    • stage/{alpha,beta,stable}
    • sig/*
    • kind/feature

cc @idvoretskyi

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 17, 2018
@vishh
Copy link
Contributor Author

vishh commented Apr 17, 2018

We now have device plugins in Beta, quota supports extensible resources, and the scheduler supports it as well. Scheduler extensions have been proven to work with extended resources.
The main missing feature is exposing metrics (kubectl top) which would require making heapster support extensible compute resources. There is an effort happening in the v1.11 time frame to make the node monitoring sub-system extensible. cc @dashpole @mindprince @RenaudWasTaken It would ideally be tackled as part of that effort.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

tamalsaha pushed a commit to kmodules/shared-informer that referenced this issue Aug 13, 2020
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```

Kubernetes-commit: 967c19df4916160d4d4fbd9a65bad41a53992de8
tamalsaha pushed a commit to kmodules/shared-informer that referenced this issue Aug 13, 2020
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```

Kubernetes-commit: 5fb38a325efb343c2a0467a12732829bd5ed3c3c
tamalsaha pushed a commit to gomodules/jsonpath that referenced this issue Apr 21, 2021
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```

Kubernetes-commit: 967c19df4916160d4d4fbd9a65bad41a53992de8
tamalsaha pushed a commit to gomodules/jsonpath that referenced this issue Apr 21, 2021
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```

Kubernetes-commit: 5fb38a325efb343c2a0467a12732829bd5ed3c3c
tamalsaha pushed a commit to gomodules/jsonpath that referenced this issue Apr 21, 2021
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```

Kubernetes-commit: 967c19df4916160d4d4fbd9a65bad41a53992de8
tamalsaha pushed a commit to gomodules/jsonpath that referenced this issue Apr 21, 2021
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```

Kubernetes-commit: 5fb38a325efb343c2a0467a12732829bd5ed3c3c
tamalsaha pushed a commit to gomodules/encoding that referenced this issue Aug 10, 2021
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```

Kubernetes-commit: 967c19df4916160d4d4fbd9a65bad41a53992de8
tamalsaha pushed a commit to gomodules/encoding that referenced this issue Aug 10, 2021
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```

Kubernetes-commit: 5fb38a325efb343c2a0467a12732829bd5ed3c3c
akhilerm pushed a commit to akhilerm/apimachinery that referenced this issue Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 50531, 50853, 49976, 50939, 50607)

Updated gRPC vendoring to support Keep Alive

**What this PR does / why we need it**:

This PR bumps the version of the vendored version of gRPC from v1.0.4 to v1.5.1
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Special notes for your reviewer**:
@vishh @jiayingz 

**Release note**:
```
Bumped gRPC from v1.0.4 to v1.5.1
```

Kubernetes-commit: 967c19df4916160d4d4fbd9a65bad41a53992de8
akhilerm pushed a commit to akhilerm/apimachinery that referenced this issue Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 51193, 51154, 42689, 51189, 51200)

Bumped gRPC version to 1.3.0

**What this PR does / why we need it**:

This PR bumps down the version of the vendored version of gRPC from v1.5.1 to v1.3.0
This is needed as part of the Device Plugin API where we expect client and server to use the Keep alive feature in order to detect an error.

Unfortunately I had to also bump the version of `golang.org/x/text` and `golang.org/x/net`.

- Design document: kubernetes/community#695
- PR tracking: [kubernetes/enhancements#368](kubernetes/enhancements#368 (comment))

**Which issue this PR fixes**: fixes #51099
Which was caused by my previous PR updating to 1.5.1

**Special notes for your reviewer**:
@vishh @jiayingz @shyamjvs

**Release note**:
```
Bumped gRPC to v1.3.0
```

Kubernetes-commit: 5fb38a325efb343c2a0467a12732829bd5ed3c3c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hw-accelerators kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/beta Denotes an issue tracking an enhancement targeted for Beta status
Projects
None yet
Development

Successfully merging a pull request may close this issue.