Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlflow-builder: fix OOM failures during build with bigger images #74

Merged
merged 1 commit into from
Nov 16, 2021

Conversation

stefannica
Copy link
Member

If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.

This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the --compressed-caching command line argument.

This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:

  - name: builder
    image: ghcr.io/stefannica/mlflow-builder:latest
    inputs:
      - name: mlflow-codeset
        codeset:
          name: '{{ inputs.mlflow-codeset }}'
          path: /project
      - name: compressed_caching
        # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory
        value: false

[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722

If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.

This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the `--compressed-caching` command line argument.

This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:

```
  - name: builder
    image: ghcr.io/stefannica/mlflow-builder:latest
    inputs:
      - name: mlflow-codeset
        codeset:
          name: '{{ inputs.mlflow-codeset }}'
          path: /project
      - name: compressed_caching
        # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory
        value: false
```

[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722
@stefannica stefannica merged commit 7be75c8 into fuseml:main Nov 16, 2021
@stefannica stefannica deleted the fix-builder-oom branch November 16, 2021 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants