mlflow-builder: fix OOM failures during build with bigger images

If the k8s node where the MLFlow builder step is running doesn't have a lot of memory, the builder step will fail if it has to build larger images. For example, building the trainer image for the keras CIFAR10 codeset example resulted in an OOM failure on a node where only 8GB of memory were available. This is a known kaniko issue [1] and there's a fix available [2] with more recent (>=1.7.0) kaniko versions: disabling the compressed caching via the `--compressed-caching` command line argument. This commit models a workflow input parameter mapped to this new command line argument. To avoid OOM errors with bigger images, the user may set it in the workflow like so: ``` - name: builder image: ghcr.io/stefannica/mlflow-builder:latest inputs: - name: mlflow-codeset codeset: name: '{{ inputs.mlflow-codeset }}' path: /project - name: compressed_caching # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory value: false ``` [1] GoogleContainerTools/kaniko#909 [2] GoogleContainerTools/kaniko#1722
stefannica · Nov 15, 2021 · fc4ffa8 · fc4ffa8
1 parent e7fb66d
commit fc4ffa8
Show file tree

Hide file tree

Showing 2 changed files with 3 additions and 1 deletion.
diff --git a/images/builders/mlflow/Dockerfile b/images/builders/mlflow/Dockerfile
@@ -47,5 +47,7 @@ ENV FUSEML_MINICONDA_VERSION ""
 ENV FUSEML_INTEL_OPTIMIZED false
 ENV FUSEML_BASE_IMAGE ""
 ENV FUSEML_VERBOSE false
+# Set to false to reduce memory usage with larger images and avoid OOM problems
+ENV FUSEML_COMPRESSED_CACHING true
 
 ENTRYPOINT ["run"]
diff --git a/images/builders/mlflow/run.sh b/images/builders/mlflow/run.sh
@@ -87,7 +87,7 @@ else
     $FUSEML_VERBOSE && cat conda.yaml
     $FUSEML_VERBOSE && cat .fuseml/Dockerfile
 
-    /kaniko/executor --insecure --dockerfile=.fuseml/Dockerfile  --context=./ --destination=${registry}/${repository}:${tag} $BUILDARGS
+    /kaniko/executor --insecure --dockerfile=.fuseml/Dockerfile  --context=./ --destination=${registry}/${repository}:${tag} --compressed-caching=${FUSEML_COMPRESSED_CACHING} $BUILDARGS
 fi
 
 printf ${destination} > /tekton/results/${TASK_RESULT}