Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GangCache Design #5

Open
wants to merge 49 commits into
base: gang
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
b21ea49
feat: add read only port support for koordlet (#320)
LambdaHJ Jun 29, 2022
a49ab45
add reconciler for runtime hook standalone work mode (#319)
zwzhang0107 Jun 29, 2022
a6b005f
add cpuset allocator (#324)
zwzhang0107 Jun 30, 2022
1fa6ec8
Fix wrong cgroup path for PLEG (#325)
cheimu Jun 30, 2022
37a3aec
add logs for proxy server (#329)
zwzhang0107 Jun 30, 2022
52a06f1
chore: remove useless feature-gates (#336)
saintube Jul 1, 2022
f89c582
ci: enable CGO when compiling binary (#334)
jasonliu747 Jul 4, 2022
b8dd567
rename resourceQoS to resourceQOS (#339)
zwzhang0107 Jul 5, 2022
6ac04d4
improve koordlet log verbosity (#338)
saintube Jul 5, 2022
54ed9a5
Add pod uid to pod meta when failover (#344)
cheimu Jul 6, 2022
1328009
Use the structure as the key of the map instead of string. (#349)
novahe Jul 7, 2022
171ad3e
koordlet: define GPU metric struct (#343)
jasonliu747 Jul 8, 2022
1ab5c99
koord-scheduler: support default preferredCPUBindPolicy for LSE/LSR P…
eahydra Jul 11, 2022
0d9d9d4
style: unify the command parameter style of koordlet (#348)
jasonliu747 Jul 11, 2022
7d46fad
add fine-grained device scheduling proposal (#322)
buptcozy Jul 11, 2022
f81c89c
[koord-runtime-proxy]: fix panic when no hook registered (#355)
cheimu Jul 12, 2022
b78243b
koord-scheduler: support CPU exclusive policy (#359)
eahydra Jul 12, 2022
05a8c11
add pod annotations and labels to container request and cache (#362)
cheimu Jul 12, 2022
b2fcc22
fix the loss of new updated resources from UpdateContainerResources r…
cheimu Jul 13, 2022
993fc21
add scheduling framework extender (#365)
saintube Jul 14, 2022
283c883
koordlet: refine initJiffies with default value (#367)
jasonliu747 Jul 14, 2022
42d695f
add PodMigrationJob CRD proposal (#358)
eahydra Jul 14, 2022
78a4ebb
add schedule gang md (#333)
buptcozy Jul 14, 2022
8179245
koord-scheduler: support Node CPU orchestration API (#360)
eahydra Jul 14, 2022
463c409
chore: update dockerfile for each module (#364)
jasonliu747 Jul 14, 2022
6918290
feat(deps): bump github.com/stretchr/testify from 1.7.5 to 1.8.0 (#326)
dependabot[bot] Jul 14, 2022
d763879
feat(deps): bump gorm.io/driver/sqlite from 1.3.4 to 1.3.6 (#347)
dependabot[bot] Jul 14, 2022
fa0c35c
chore: supply UT for pkg/util and pkg/util/system (#374)
ZiMengSheng Jul 18, 2022
c9cf1a4
api: add PodMigrationJob API (#375)
eahydra Jul 18, 2022
f2ed65d
docs: remove redundant field in Device CRD (#377)
jasonliu747 Jul 18, 2022
91cacc4
api: add device crd in scheduling group (#376)
jasonliu747 Jul 18, 2022
b54bb0c
fix auditor test in MacOS (#379)
hormes Jul 18, 2022
91d2a4b
koordlet: optimize auditor UT with httptest.Server (#382)
ZiMengSheng Jul 19, 2022
d161ee3
docs: add chinese version readme.md (#380)
ZiMengSheng Jul 19, 2022
0523d60
fix: consider lse/lsr when cpu suppress (#234) (#372)
ZYecho Jul 19, 2022
dab5a92
api: add device info into NodeMetric CRD (#378)
jasonliu747 Jul 20, 2022
4301cc9
feat: collect gpu metrics (#361)
LambdaHJ Jul 20, 2022
4d9f218
chore: cleanup resmanager (#383)
saintube Jul 20, 2022
74de8bd
api: update reservation api (#384)
saintube Jul 20, 2022
d1fb8c5
add descheduler framework proposal (#371)
eahydra Jul 20, 2022
f32a0ba
feat(deps): bump gorm.io/gorm from 1.23.6 to 1.23.8 (#351)
dependabot[bot] Jul 20, 2022
bf308ed
fix: remove inline tag for corev1.ResourceList to fix #390 (#391)
jasonliu747 Jul 21, 2022
57c29bb
add gang Consts
Jul 26, 2022
d2d910d
add the GangAnnotationPrefix
Jul 26, 2022
9e49662
add gang Consts
Jul 26, 2022
40c334f
add gang Plugin
Jul 26, 2022
70cca7d
add gang Cache
Jul 26, 2022
eaa3be4
adjust consts
Jul 26, 2022
b7ce489
add the GangAnnotationPrefix2
Jul 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -117,8 +117,7 @@ jobs:
platforms: linux/amd64
push: true
pull: true
build-args: |
MODULE=${{ matrix.target }}
file: docker/${{ matrix.target }}.dockerfile
labels: |
org.opencontainers.image.title=${{ matrix.target }}
org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
Expand Down
39 changes: 18 additions & 21 deletions .goreleaser.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ before:
hooks:
- go mod tidy
builds:
- env:
- id: koord-runtime-proxy
env:
- CGO_ENABLED=0
goos:
- linux
goarch:
- amd64
id: koord-runtime-proxy
main: ./cmd/koord-runtime-proxy
binary: koord-runtime-proxy
ldflags:
Expand All @@ -20,13 +20,13 @@ builds:
- -X github.com/koordinator-sh/koordinator/pkg/version.buildDate={{ .Date }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitCommit={{ .Commit }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitTreeState=clean
- env:
- id: koord-manager
env:
- CGO_ENABLED=0
goos:
- linux
goarch:
- amd64
id: koord-manager
main: ./cmd/koord-manager
binary: koord-manager
ldflags:
Expand All @@ -35,13 +35,13 @@ builds:
- -X github.com/koordinator-sh/koordinator/pkg/version.buildDate={{ .Date }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitCommit={{ .Commit }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitTreeState=clean
- env:
- id: koord-scheduler
env:
- CGO_ENABLED=0
goos:
- linux
goarch:
- amd64
id: koord-scheduler
main: ./cmd/koord-scheduler
binary: koord-scheduler
ldflags:
Expand All @@ -50,13 +50,13 @@ builds:
- -X github.com/koordinator-sh/koordinator/pkg/version.buildDate={{ .Date }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitCommit={{ .Commit }}
- -X github.com/koordinator-sh/koordinator/pkg/version.gitTreeState=clean
- env:
- CGO_ENABLED=0
- id: koordlet
env:
- CGO_ENABLED=1
goos:
- linux
goarch:
- amd64
id: koordlet
main: ./cmd/koordlet
binary: koordlet
ldflags:
Expand Down Expand Up @@ -87,10 +87,11 @@ changelog:
- '^chore:'
- '^feat(deps):'
dockers:
- image_templates:
- id: koord-manager
image_templates:
- "ghcr.io/{{.ProjectName}}/koord-manager:{{ .Version }}"
- "registry.cn-beijing.aliyuncs.com/{{.ProjectName}}/koord-manager:{{ .Version }}"
dockerfile: .goreleaser/Dockerfile
dockerfile: .goreleaser/koord-manager.dockerfile
build_flag_templates:
- "--pull"
- "--label=org.opencontainers.image.title=koord-manager"
Expand All @@ -99,16 +100,15 @@ dockers:
- "--label=org.opencontainers.image.revision={{.FullCommit}}"
- "--label=org.opencontainers.image.version={{.Version}}"
- "--label=org.opencontainers.image.licenses=Apache-2.0"
- "--build-arg=MODULE=koord-manager"
id: koord-manager
ids:
- koord-manager
goos: linux
goarch: amd64
- image_templates:
- id: koordlet
image_templates:
- "ghcr.io/{{.ProjectName}}/koordlet:{{ .Version }}"
- "registry.cn-beijing.aliyuncs.com/{{.ProjectName}}/koordlet:{{ .Version }}"
dockerfile: .goreleaser/Dockerfile
dockerfile: .goreleaser/koordlet.dockerfile
build_flag_templates:
- "--pull"
- "--label=org.opencontainers.image.title=koordlet"
Expand All @@ -117,16 +117,15 @@ dockers:
- "--label=org.opencontainers.image.revision={{.FullCommit}}"
- "--label=org.opencontainers.image.version={{.Version}}"
- "--label=org.opencontainers.image.licenses=Apache-2.0"
- "--build-arg=MODULE=koordlet"
id: koordlet
ids:
- koordlet
goos: linux
goarch: amd64
- image_templates:
- id: koord-scheduler
image_templates:
- "ghcr.io/{{.ProjectName}}/koord-scheduler:{{ .Version }}"
- "registry.cn-beijing.aliyuncs.com/{{.ProjectName}}/koord-scheduler:{{ .Version }}"
dockerfile: .goreleaser/Dockerfile
dockerfile: .goreleaser/koord-scheduler.dockerfile
build_flag_templates:
- "--pull"
- "--label=org.opencontainers.image.title=koord-scheduler"
Expand All @@ -135,8 +134,6 @@ dockers:
- "--label=org.opencontainers.image.revision={{.FullCommit}}"
- "--label=org.opencontainers.image.version={{.Version}}"
- "--label=org.opencontainers.image.licenses=Apache-2.0"
- "--build-arg=MODULE=koord-scheduler"
id: koord-scheduler
ids:
- koord-scheduler
goos: linux
Expand Down
7 changes: 0 additions & 7 deletions .goreleaser/Dockerfile

This file was deleted.

4 changes: 4 additions & 0 deletions .goreleaser/koord-manager.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM gcr.io/distroless/static:latest
WORKDIR /
COPY koord-manager .
ENTRYPOINT ["/koord-manager"]
4 changes: 4 additions & 0 deletions .goreleaser/koord-scheduler.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM gcr.io/distroless/static:latest
WORKDIR /
COPY koord-scheduler .
ENTRYPOINT ["/koord-scheduler"]
4 changes: 4 additions & 0 deletions .goreleaser/koordlet.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM nvidia/cuda:11.6.1-base-ubuntu20.04
WORKDIR /
COPY koordlet .
ENTRYPOINT ["/koordlet"]
33 changes: 0 additions & 33 deletions Dockerfile

This file was deleted.

6 changes: 3 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -114,15 +114,15 @@ docker-build: test docker-build-koordlet docker-build-koord-manager docker-build

.PHONY: docker-build-koordlet
docker-build-koordlet: ## Build docker image with the koordlet.
docker build --build-arg MODULE=koordlet -t ${KOORDLET_IMG} .
docker build --pull -t ${KOORDLET_IMG} -f docker/koordlet.dockerfile .

.PHONY: docker-build-koord-manager
docker-build-koord-manager: ## Build docker image with the koord-manager.
docker build --build-arg MODULE=koord-manager -t ${KOORD_MANAGER_IMG} .
docker build --pull -t ${KOORD_MANAGER_IMG} -f docker/koord-manager.dockerfile .

.PHONY: docker-build-koord-scheduler
docker-build-koord-scheduler: ## Build docker image with the scheduler.
docker build --build-arg MODULE=koord-scheduler -t ${KOORD_SCHEDULER_IMG} .
docker build --pull -t ${KOORD_SCHEDULER_IMG} -f docker/koord-scheduler.dockerfile .

.PHONY: docker-push
docker-push: docker-push-koordlet docker-push-koord-manager docker-push-koord-scheduler
Expand Down
65 changes: 65 additions & 0 deletions README-zh_CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<h1 align="center">
<p align="center">Koordinator</p>
<a href="https://koordinator.sh"><img src="https://github.com/koordinator-sh/koordinator/raw/main/docs/images/koordinator-logo.jpeg" alt="Koordinator"></a>
</h1>

[![License](https://img.shields.io/github/license/koordinator-sh/koordinator.svg?color=4EB1BA&style=flat-square)](https://opensource.org/licenses/Apache-2.0)
[![GitHub release](https://img.shields.io/github/v/release/koordinator-sh/koordinator.svg?style=flat-square)](https://github.com/koordinator-sh/koordinator/releases/latest)
[![CI](https://img.shields.io/github/workflow/status/koordinator-sh/koordinator/CI?label=CI&logo=github&style=flat-square)](https://github.com/koordinator-sh/koordinator/actions/workflows/ci.yaml)
[![Go Report Card](https://goreportcard.com/badge/github.com/koordinator-sh/koordinator?style=flat-square)](https://goreportcard.com/report/github.com/koordinator-sh/koordinator)
[![codecov](https://img.shields.io/codecov/c/github/koordinator-sh/koordinator?logo=codecov&style=flat-square)](https://codecov.io/github/koordinator-sh/koordinator)
[![PRs Welcome](https://badgen.net/badge/PRs/welcome/green?icon=https://api.iconify.design/octicon:git-pull-request.svg?color=white&style=flat-square)](CONTRIBUTING.md)
[![Slack](https://badgen.net/badge/slack/join/4A154B?icon=slack&style=flat-square)](https://join.slack.com/t/koordinator-sh/shared_invite/zt-1756qoub4-Cn4~esfdlfAPsD7cwO2NzA)


[English](./README.md) | 简体中文



## 介绍

Koordinator 基于 QoS 机制,支持 Kubernetes 上多种工作负载的混部调度。它旨在提高工作负载的运行时效率和可靠性(包括延迟敏感型负载和批处理任务),简化资源相关的配置调优,增加 Pod 部署密度以提高资源利用率。

Koordinator 通过提供如下功能来增强用户在 Kubernetes 上管理工作负载的体验:

- 精心设计的 Priority 和 QoS 机制,支持在一个集群或者一个节点上混部不同的工作负载。
- 采用应用画像机制(application profiling mechanism),支持超卖资源,以实现在满足 QoS 保障的前提下实现高资源利用率。
- 细粒度的资源编排和隔离机制以提高工作负载(包括延迟敏感型负载和批处理任务)的效率。
- 灵活的任务调度机制以支持特定领域(如大数据、AI、音频和视频)的工作负载。
- 一套支持监控、故障排除、运维的工具集。

## 快速开始

你可以在 [Koordinator website](https://koordinator.sh/docs) 查看到完整的文档集。

- 安装/升级 Koordinator [最新版本](https://koordinator.sh/docs/installation)
- 参考[最佳实践](https://koordinator.sh/docs/best-practices/colocation-of-spark-jobs),里面有一些关于运行混部工作负载的示例。

## 行为守则

Koordinator 社区遵照[行为守则](CODE_OF_CONDUCT.md)。我们鼓励每个人在参与之前先读一下它。

为了营造一个开放和热情的环境,我们作为贡献者和维护者承诺:无论年龄、体型、残疾、种族、经验水平、教育程度、社会经济地位、国籍、个人外貌、种族、宗教或性认同和性取向如何,参与我们的项目和社区的每个人都不会受到骚扰。

## 贡献

我们非常欢迎每一位社区同学共同参与 Koordinator 的建设,你可以从 [CONTRIBUTING.md](CONTRIBUTING.md) 手册开始。

## 成员

我们鼓励所有贡献者成为成员。我们的目标是发展一个由贡献者、审阅者和代码所有者组成的活跃、健康的社区。在我们的[社区成员](docs/community/community-membership.md)页面,详细了解我们的成员要求和责任。

## 社区

你可以通过如下途径联系到项目维护者:

- [Slack](https://join.slack.com/t/koordinator-sh/shared_invite/zt-1756qoub4-Cn4~esfdlfAPsD7cwO2NzA)
- 钉钉( Chinese ): 搜索群ID `33383887`或者扫描二维码加入

<div>
<img src="https://github.com/koordinator-sh/koordinator/raw/main/docs/images/dingtalk.png" width="300" alt="Dingtalk QRCode">
</div>

## License

Koordinator is licensed under the Apache License, Version 2.0. See [LICENSE](./LICENSE) for the full license text.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
[![PRs Welcome](https://badgen.net/badge/PRs/welcome/green?icon=https://api.iconify.design/octicon:git-pull-request.svg?color=white&style=flat-square)](CONTRIBUTING.md)
[![Slack](https://badgen.net/badge/slack/join/4A154B?icon=slack&style=flat-square)](https://join.slack.com/t/koordinator-sh/shared_invite/zt-1756qoub4-Cn4~esfdlfAPsD7cwO2NzA)

English | [简体中文](./README-zh_CN.md)
## Introduction

Koordinator is a QoS based scheduling system for hybrid orchestration workloads on Kubernetes. It aims to improve the
Expand Down Expand Up @@ -42,7 +43,7 @@ before participating.

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making
participation in our project and our community a harassment-free experience for everyone, regardless of age, body size,
disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status,
disability, ethnicity, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity and orientation.

## Contributing
Expand Down
31 changes: 31 additions & 0 deletions apis/extension/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ import (
"encoding/json"

"k8s.io/apimachinery/pkg/types"

schedulingconfig "github.com/koordinator-sh/koordinator/apis/scheduling/config"
)

const (
Expand All @@ -30,6 +32,22 @@ const (
// AnnotationNodeCPUSharedPools describes the CPU Shared Pool defined by Koordinator.
// The shared pool is mainly used by Koordinator LS Pods or K8s Burstable Pods.
AnnotationNodeCPUSharedPools = NodeDomainPrefix + "/cpu-shared-pools"

// LabelNodeCPUBindPolicy constrains how to bind CPU logical CPUs when scheduling.
LabelNodeCPUBindPolicy = NodeDomainPrefix + "/cpu-bind-policy"
// LabelNodeNUMAAllocateStrategy indicates how to choose satisfied NUMA Nodes when scheduling.
LabelNodeNUMAAllocateStrategy = NodeDomainPrefix + "/numa-allocate-strategy"
)

const (
// NodeCPUBindPolicyFullPCPUsOnly requires that the scheduler must allocate full physical cores.
// Equivalent to kubelet CPU manager policy option full-pcpus-only=true.
NodeCPUBindPolicyFullPCPUsOnly = "FullPCPUsOnly"
)

const (
NodeNUMAAllocateStrategyLeastAllocated = string(schedulingconfig.NUMALeastAllocated)
NodeNUMAAllocateStrategyMostAllocated = string(schedulingconfig.NUMAMostAllocated)
)

type CPUTopology struct {
Expand Down Expand Up @@ -77,3 +95,16 @@ func GetPodCPUAllocs(annotations map[string]string) (PodCPUAllocs, error) {
}
return allocs, nil
}

func GetNodeCPUSharePools(nodeTopoAnnotations map[string]string) ([]CPUSharedPool, error) {
var cpuSharePools []CPUSharedPool
data, ok := nodeTopoAnnotations[AnnotationNodeCPUSharedPools]
if !ok {
return cpuSharePools, nil
}
err := json.Unmarshal([]byte(data), &cpuSharePools)
if err != nil {
return nil, err
}
return cpuSharePools, nil
}
6 changes: 3 additions & 3 deletions apis/extension/pod.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ import (
const (
AnnotationPodCPUBurst = DomainPrefix + "cpuBurst"

AnnotationPodMemoryQoS = DomainPrefix + "memoryQoS"
AnnotationPodMemoryQoS = DomainPrefix + "memoryQOS"
)

func GetPodCPUBurstConfig(pod *corev1.Pod) (*slov1aplhpa1.CPUBurstConfig, error) {
Expand All @@ -47,15 +47,15 @@ func GetPodCPUBurstConfig(pod *corev1.Pod) (*slov1aplhpa1.CPUBurstConfig, error)
return &cpuBurst, nil
}

func GetPodMemoryQoSConfig(pod *corev1.Pod) (*slov1aplhpa1.PodMemoryQoSConfig, error) {
func GetPodMemoryQoSConfig(pod *corev1.Pod) (*slov1aplhpa1.PodMemoryQOSConfig, error) {
if pod == nil || pod.Annotations == nil {
return nil, nil
}
value, exist := pod.Annotations[AnnotationPodMemoryQoS]
if !exist {
return nil, nil
}
cfg := slov1aplhpa1.PodMemoryQoSConfig{}
cfg := slov1aplhpa1.PodMemoryQOSConfig{}
err := json.Unmarshal([]byte(value), &cfg)
if err != nil {
return nil, err
Expand Down
Loading