mlflow-builder: fix OOM failures during build with bigger images #74

stefannica · 2021-11-15T21:32:41Z

If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.

This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the --compressed-caching command line argument.

This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:

  - name: builder
    image: ghcr.io/stefannica/mlflow-builder:latest
    inputs:
      - name: mlflow-codeset
        codeset:
          name: '{{ inputs.mlflow-codeset }}'
          path: /project
      - name: compressed_caching
        # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory
        value: false

[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722

If the k8s node where the MLFlow builder step is running doesn't have a lot of memory, the builder step will fail if it has to build larger images. For example, building the trainer image for the keras CIFAR10 codeset example resulted in an OOM failure on a node where only 8GB of memory were available. This is a known kaniko issue [1] and there's a fix available [2] with more recent (>=1.7.0) kaniko versions: disabling the compressed caching via the `--compressed-caching` command line argument. This commit models a workflow input parameter mapped to this new command line argument. To avoid OOM errors with bigger images, the user may set it in the workflow like so: ``` - name: builder image: ghcr.io/stefannica/mlflow-builder:latest inputs: - name: mlflow-codeset codeset: name: '{{ inputs.mlflow-codeset }}' path: /project - name: compressed_caching # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory value: false ``` [1] GoogleContainerTools/kaniko#909 [2] GoogleContainerTools/kaniko#1722

stefannica requested review from jsuchome and flaviodsr November 15, 2021 21:32

jsuchome approved these changes Nov 16, 2021

View reviewed changes

flaviodsr approved these changes Nov 16, 2021

View reviewed changes

stefannica merged commit 7be75c8 into fuseml:main Nov 16, 2021

stefannica deleted the fix-builder-oom branch November 16, 2021 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlflow-builder: fix OOM failures during build with bigger images #74

mlflow-builder: fix OOM failures during build with bigger images #74

stefannica commented Nov 15, 2021

mlflow-builder: fix OOM failures during build with bigger images #74

mlflow-builder: fix OOM failures during build with bigger images #74

Conversation

stefannica commented Nov 15, 2021