Skip to content

Commit

Permalink
mlflow-builder: fix OOM failures during build with bigger images
Browse files Browse the repository at this point in the history
If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.

This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the `--compressed-caching` command line argument.

This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:

```
  - name: builder
    image: ghcr.io/stefannica/mlflow-builder:latest
    inputs:
      - name: mlflow-codeset
        codeset:
          name: '{{ inputs.mlflow-codeset }}'
          path: /project
      - name: compressed_caching
        # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory
        value: false
```

[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722
  • Loading branch information
stefannica committed Nov 15, 2021
1 parent e7fb66d commit fc4ffa8
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 1 deletion.
2 changes: 2 additions & 0 deletions images/builders/mlflow/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,5 +47,7 @@ ENV FUSEML_MINICONDA_VERSION ""
ENV FUSEML_INTEL_OPTIMIZED false
ENV FUSEML_BASE_IMAGE ""
ENV FUSEML_VERBOSE false
# Set to false to reduce memory usage with larger images and avoid OOM problems
ENV FUSEML_COMPRESSED_CACHING true

ENTRYPOINT ["run"]
2 changes: 1 addition & 1 deletion images/builders/mlflow/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ else
$FUSEML_VERBOSE && cat conda.yaml
$FUSEML_VERBOSE && cat .fuseml/Dockerfile

/kaniko/executor --insecure --dockerfile=.fuseml/Dockerfile --context=./ --destination=${registry}/${repository}:${tag} $BUILDARGS
/kaniko/executor --insecure --dockerfile=.fuseml/Dockerfile --context=./ --destination=${registry}/${repository}:${tag} --compressed-caching=${FUSEML_COMPRESSED_CACHING} $BUILDARGS
fi

printf ${destination} > /tekton/results/${TASK_RESULT}

0 comments on commit fc4ffa8

Please sign in to comment.