Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make image builder timeout configurable #992

Merged
merged 8 commits into from
Sep 17, 2024
2 changes: 1 addition & 1 deletion backend/backend/settings/deps/image_build.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os

# How long we wait before throwing errors, in seconds
IMAGE_BUILD_TIMEOUT = 3 * 60 * 60 # 3 hours
IMAGE_BUILD_TIMEOUT = int(os.getenv("IMAGE_BUILD_TIMEOUT", 60 * 60)) # 1 hour
# Delay before two check
IMAGE_BUILD_CHECK_DELAY = 5
BUILDER_KANIKO_STARTUP_MAX_ATTEMPTS = int(os.getenv("BUILDER_KANIKO_STARTUP_MAX_ATTEMPTS", 60))
Expand Down
1 change: 0 additions & 1 deletion backend/substrapp/compute_tasks/image_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@

logger = structlog.get_logger(__name__)

IMAGE_BUILD_TIMEOUT = settings.IMAGE_BUILD_TIMEOUT
IMAGE_BUILD_CHECK_DELAY = settings.IMAGE_BUILD_CHECK_DELAY
REGISTRY = settings.REGISTRY
REGISTRY_SCHEME = settings.REGISTRY_SCHEME
Expand Down
10 changes: 10 additions & 0 deletions backend/substrapp/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,13 @@ class IntegrityError(Exception):

class ServerMediasNoSubdirError(Exception):
"""A supplied servermedias path didn't contain the expected subdir"""


class LockError(Exception):
"""Cannot get a lock file to write on"""

lock_file: str

def __init__(self, *args, lock_file: str, **kwargs) -> None:
self.lock_file = lock_file
super().__init__(args, kwargs)
4 changes: 3 additions & 1 deletion backend/substrapp/lock_local.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

import structlog

from substrapp.exceptions import LockError

logger = structlog.get_logger(__name__)

LOCK_FILE_FOLDER = "/tmp" # nosec
Expand Down Expand Up @@ -47,7 +49,7 @@ def lock_resource(
did_wait = True
logger.debug("Lock: Waiting for lock to be released", lock_file=lock_file)
if time.time() - start > timeout:
raise Exception(f"Failed to acquire lock after {timeout} seconds", lock_file=lock_file)
raise LockError(f"Failed to acquire lock after {timeout} seconds", lock_file=lock_file)
time.sleep(delay)

logger.debug("Lock: Acquired", lock_file=lock_file)
Expand Down
20 changes: 20 additions & 0 deletions backend/substrapp/tests/test_lock_local.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from pathlib import Path

import pytest

from substrapp import lock_local
from substrapp.exceptions import LockError


# Test that `lock_local.lock_resource() returns the correct excception
def test_lock_local_timeout():
resource_type = "test"
unique_identifier = "1"
# Create fake lock
slug = f"{resource_type}_{unique_identifier}"
lock_path = Path(lock_local._get_lock_file_path(slug))
lock_path.write_text("unique_id")

with pytest.raises(LockError):
with lock_local.lock_resource(resource_type, unique_identifier, timeout=1):
pass
1 change: 1 addition & 0 deletions changes/992.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add configurable builder timeout through the env var `IMAGE_BUILD_TIMEOUT`
6 changes: 6 additions & 0 deletions charts/substra-backend/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Changelog

<!-- towncrier release notes start -->
## [26.11.0] - 2024-09-17

# Added

New configurable value `.Values.builder.timeout` configuring the duration before a building task would fail.

## [26.10.2] - 2024-09-17

# Fixed
Expand Down
2 changes: 1 addition & 1 deletion charts/substra-backend/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v2
name: substra-backend
home: https://github.com/Substra
version: 26.10.2
version: 26.11.0
appVersion: 0.48.0
kubeVersion: '>= 1.19.0-0'
description: Main package for Substra
Expand Down
1 change: 1 addition & 0 deletions charts/substra-backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ See [UPGRADE.md](https://github.com/Substra/substra-backend/blob/main/charts/sub
| `builder.enabled` | Enable builder service | `true` |
| `builder.replicaCount` | Replica count for the builder service | `1` |
| `builder.concurrency` | Maximum amount of tasks to process in parallel | `1` |
| `builder.timeout` | Number of seconds to wait before failing a build | `3600` |
| `builder.kanikoStartup.maxAttempts` | Number of checks done before considering a kaniko pod has failed to spawn | `60` |
| `builder.kanikoStartup.checkDelay` | Time, in seconds, to wait when a container is in pending state before checking again its status | `2` |
| `builder.image.registry` | Substra backend server image registry | `ghcr.io` |
Expand Down
2 changes: 2 additions & 0 deletions charts/substra-backend/templates/statefulset-builder.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,8 @@ spec:
value: {{ toYaml .Values.builder.kanikoStartup.maxAttempts | quote }}
- name: BUILDER_KANIKO_STARTUP_PENDING_STATE_WAIT_SECONDS
value: {{ toYaml .Values.builder.kanikoStartup.checkDelay | quote }}
- name: IMAGE_BUILD_TIMEOUT
value: {{ toYaml .Values.builder.timeout | quote }}
ports:
- name: http
containerPort: 8000
Expand Down
4 changes: 4 additions & 0 deletions charts/substra-backend/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,10 @@ builder:
##
concurrency: 1

## @param builder.timeout Number of seconds to wait before failing a build
##
timeout: 3600

## @param builder.kanikoStartup.maxAttempts Number of checks done before considering a kaniko pod has failed to spawn
## @param builder.kanikoStartup.checkDelay Time, in seconds, to wait when a container is in pending state before checking again its status
##
Expand Down
Loading