Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow custom compute pod affinities, nodeSelector and tolerations #935

Merged
merged 22 commits into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions backend/image_transfer/encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ def get_manifests_and_list_of_all_blobs(
raise RegistryPreconditionFailedException(
f"{docker_image} is either not scanned yet or not passing the vulnerability checks."
) from e
raise e
manifests.append(manifest)
blobs_to_pull += blobs
return manifests, blobs_to_pull
Expand Down
2 changes: 2 additions & 0 deletions backend/substrapp/clients/organization.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,9 @@ def get(
) -> bytes:
"""Get asset data."""
content = _http_request(_Method.GET, channel, organization_id, url).content

new_checksum = compute_hash(content, key=salt)

if new_checksum != checksum:
raise IntegrityError(f"url {url}: checksum doesn't match {checksum} vs {new_checksum}")
return content
Expand Down
21 changes: 4 additions & 17 deletions backend/substrapp/compute_tasks/compute_pod.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import kubernetes
import structlog
import yaml
from django.conf import settings

from substrapp.kubernetes_utils import delete_pod
Expand Down Expand Up @@ -120,22 +121,6 @@ def create_pod(
**container_optional_kwargs,
)

pod_affinity = kubernetes.client.V1Affinity(
pod_affinity=kubernetes.client.V1PodAffinity(
required_during_scheduling_ignored_during_execution=[
kubernetes.client.V1PodAffinityTerm(
label_selector=kubernetes.client.V1LabelSelector(
match_expressions=[
kubernetes.client.V1LabelSelectorRequirement(
key="statefulset.kubernetes.io/pod-name", operator="In", values=[os.getenv("HOSTNAME")]
)
]
),
topology_key="kubernetes.io/hostname",
)
]
)
)
image_pull_secret = os.getenv("DOCKER_CONFIG_SECRET_NAME")

if image_pull_secret:
Expand All @@ -144,7 +129,9 @@ def create_pod(
image_pull_secrets = None
spec = kubernetes.client.V1PodSpec(
restart_policy="Never",
affinity=pod_affinity,
affinity=yaml.safe_load(os.getenv("COMPUTE_POD_AFFINITY")),
node_selector=yaml.safe_load(os.getenv("COMPUTE_POD_NODE_SELECTOR")),
tolerations=yaml.safe_load(os.getenv("COMPUTE_POD_TOLERATIONS")),
containers=[container_compute],
volumes=volumes + gpu_volume,
security_context=get_pod_security_context(),
Expand Down
1 change: 0 additions & 1 deletion backend/substrapp/compute_tasks/image_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ def push_blob_to_registry(blob: bytes, tag: str) -> None:
def load_remote_function_image(function: orchestrator.Function, channel: str) -> None:
# Ask the backend owner of the function if it's available
container_image_tag = utils.container_image_tag_from_function(function)

function_image_content = organization_client.get(
channel=channel,
organization_id=function.owner,
Expand Down
5 changes: 5 additions & 0 deletions charts/substra-backend/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Changelog

<!-- towncrier release notes start -->
## [26.9.0] - 2024-07-22

# Added

Configuration of compute pod `affinity`, `nodeSelector` and `toleration` on `values.yaml` file.

## [26.8.3] - 2024-07-16

Expand Down
4 changes: 2 additions & 2 deletions charts/substra-backend/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
apiVersion: v2
name: substra-backend
home: https://github.com/Substra
version: 26.8.3
appVersion: 0.47.0
version: "26.9.0"
appVersion: "0.47.0"
kubeVersion: ">= 1.19.0-0"
description: Main package for Substra
type: application
Expand Down
117 changes: 62 additions & 55 deletions charts/substra-backend/README.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions charts/substra-backend/changes/935.changed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Compute pod `affinity`, `nodeSelector` and `tolerations` are now configured for environment variable defined in the `values.yaml` file.
12 changes: 11 additions & 1 deletion charts/substra-backend/templates/statefulset-worker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,16 @@ spec:
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
thbcmlowk marked this conversation as resolved.
Show resolved Hide resolved
- name: COMPUTE_POD_AFFINITY
value: {{ toYaml .Values.worker.computePod.affinity | quote }}
- name: COMPUTE_POD_NODE_SELECTOR
value: {{ toYaml .Values.worker.computePod.nodeSelector | quote }}
- name: COMPUTE_POD_TOLERATIONS
value: {{ toYaml .Values.worker.computePod.tolerations | quote }}
- name: COMPUTE_POD_RESOURCES
value: {{ toYaml .Values.worker.computePod.resources | quote }}
- name: COMPUTE_POD_MAX_STARTUP_WAIT_SECONDS
Expand Down Expand Up @@ -231,7 +241,7 @@ spec:
- metadata:
name: subtuple
spec:
accessModes: [ "ReadWriteOnce" ]
accessModes: {{ .Values.worker.accessModes }}
{{ include "common.storage.class" .Values.worker.persistence }}
resources:
requests:
Expand Down
27 changes: 25 additions & 2 deletions charts/substra-backend/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ server:
##
honorLabels: false

## @section Substra worker settings
## @section Substra worker settings. Note that you can access the worker pod name using $(POD_NAME) and its node using $(NODE_NAME).
##
worker:
## @param worker.enabled Enable worker service
Expand Down Expand Up @@ -376,6 +376,27 @@ worker:
memory: "1Gi"
limits:
memory: "64Gi"
## @param worker.computePod.nodeSelector Node labels for pod assignment
##
nodeSelector: {}
## @param worker.computePod.tolerations Toleration labels for pod assignment
##
tolerations: []
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].key Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].operator Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].labelSelector.matchExpressions[0].values Pod affinity rule defnition.
## @param worker.computePod.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution[0].topologyKey Pod affinity rule defnition.
##
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: statefulset.kubernetes.io/pod-name
operator: In
values:
- $(POD_NAME)
guilhem-barthes marked this conversation as resolved.
Show resolved Hide resolved
topologyKey: kubernetes.io/hostname
events:
## @param worker.events.enabled Enable event service
##
Expand Down Expand Up @@ -435,7 +456,9 @@ worker:
## If not set and create is true, a name is generated using the substra.fullname template
##
name: ""

## @param worker.accessModes Access modes for volume
##
accessModes: ["ReadWriteOnce"]
## @section Substra periodic tasks worker settings
##
schedulerWorker:
Expand Down
Loading