Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kaniko build using too many memory #909

Open
jyipks opened this issue Dec 11, 2019 · 35 comments
Open

kaniko build using too many memory #909

jyipks opened this issue Dec 11, 2019 · 35 comments
Labels
area/performance issues related to kaniko performance enhancement categorized differs-from-docker issue/oom kind/question Further information is requested priority/p1 Basic need feature compatibility with docker build. we should be working on this next. regression/v1.2.0 regression works-with-docker

Comments

@jyipks
Copy link

jyipks commented Dec 11, 2019

I am building a rather large docker image, end size is ~8GB. It builds fine in DinD, however we would like to use kaniko. The kaniko pod running the dockerfile balloons in memory usage and gets killed by kubernetes. How can I make kaniko work for me, or am I stuck with DinD?

Please help, thank you

@cvgw cvgw added kind/question Further information is requested area/performance issues related to kaniko performance enhancement labels Dec 21, 2019
@tejal29
Copy link
Member

tejal29 commented Jan 10, 2020

/cc @priyawadhwa Can we provide users anything to measure the memory usage ?

@jyipks Can you tell us if you have set resource limits in kaniko pod spec
Also please tell us cluster specification

@priyawadhwa
Copy link
Collaborator

@tejal29 @jyipks the only thing I can think of is upping the resource limits on the pod as well

@jyipks
Copy link
Author

jyipks commented Jan 14, 2020

i had no resource limits on the kaniko pods. This was on a 3 node cluster, 4cores, 16GB each. From grafana i believe the pod attempted to use more than 15GB
I was building a custom jupyter-notebook image, that normally comes out to be ~8GB upon build completion via docker build.

@jyipks
Copy link
Author

jyipks commented Jan 21, 2020

does kaniko keep everything in memory as its building the image or it writes to a temp directory? if it goes into a temp directory can you please provide it?

thanks

@mamoit
Copy link

mamoit commented Jul 28, 2020

This sounds like #862.
@jyipks do you remember if you were using the --reproducible flag?

@jyipks
Copy link
Author

jyipks commented Jul 28, 2020

No i've never used that flag before.

@rvaidya
Copy link

rvaidya commented Mar 11, 2021

This also happens when trying to do an npm install - I also have never used that flag before.

@max107
Copy link

max107 commented Mar 12, 2021

same problem

@tarakanof
Copy link

Same problem on gitlab runner: latest debian with latest docker. Building 12Mb docker image uses 15Gb - 35Gb of memory.

@fknop
Copy link

fknop commented Mar 26, 2021

We're facing the same issue in Gitlab CI custom runner.
We're building a docker image for node, the build started hanging on webpack everytime and the machine ends up running out of memory and crashes.
It used to work fine without any issue.
Our docker image is a little less than 300MB and our machine has 8Gb ram

@meseta
Copy link

meseta commented Mar 30, 2021

Similar issue on Gitlab CI on GKE. We're building a python image based on official python base image, it consumes about 12Gb of RAM

@jamil-s
Copy link

jamil-s commented Apr 19, 2021

We're seeing similar issues with gradle builds as well

@nichoio
Copy link

nichoio commented Apr 20, 2021

Would also like to learn more about this. Kaniko doesn't have a feature equivalent to docker build --memory, does it?

@suprememoocow
Copy link

We're seeing similar issues too. For example, this job failed with OOM: https://gitlab.com/gitlab-com/gl-infra/tamland/-/jobs/1405946307

The job includes some stacktrace information, which may help in diagnosing the problem.

The parameters that we were using, including --snapshotMode=redo are here: https://gitlab.com/gitlab-com/gl-infra/tamland/-/commit/0b399381d30655059ec78461640674af7562c708#587d266bb27a4dc3022bbed44dfa19849df3044c_116_125

@mikesir87
Copy link

I'm having the same problem as well. But, in my case, it's a Java-based build and the Maven cache repo is being included as an ignore-path. The number of changes that should occur outside of that are fairly minimal, yet I'm easily seeing 5+ GB of RAM being used where the build before that was using at most 1.2GB. We'd love to be able to use smaller instances for our builds.

@trallnag
Copy link

I rolled back to 1.3.0 from 1.6.0 and now it seems to work again

@Phylu
Copy link
Contributor

Phylu commented Oct 19, 2021

This should be closed in the 1.7.0 release as of #1722.

@s3f4
Copy link

s3f4 commented Oct 23, 2021

I rolled back to 1.3.0 from 1.6.0 and now it seems to work again

1.7 has a gcloud credentials problem, rolling back to 1.3.0 worked.

@Exagone313
Copy link

Do you know when the tag gcr.io/kaniko-project/executor:debug (as well as :latest) gets updated? It still points to the v1.6.0 version: https://console.cloud.google.com/gcr/images/kaniko-project/GLOBAL/executor

stefannica added a commit to stefannica/extensions that referenced this issue Nov 15, 2021
If the k8s node where the MLFlow builder step is running doesn't
have a lot of memory, the builder step will fail if it has to build
larger images. For example, building the trainer image for the keras
CIFAR10 codeset example resulted in an OOM failure on a node where
only 8GB of memory were available.

This is a known kaniko issue [1] and there's a fix available [2] with
more recent (>=1.7.0) kaniko versions: disabling the compressed
caching via the `--compressed-caching` command line argument.

This commit models a workflow input parameter mapped to this
new command line argument. To avoid OOM errors with bigger
images, the user may set it in the workflow like so:

```
  - name: builder
    image: ghcr.io/stefannica/mlflow-builder:latest
    inputs:
      - name: mlflow-codeset
        codeset:
          name: '{{ inputs.mlflow-codeset }}'
          path: /project
      - name: compressed_caching
        # Disable compressed caching to avoid running into OOM errors on cluster nodes with lower memory
        value: false
```

[1] GoogleContainerTools/kaniko#909
[2] GoogleContainerTools/kaniko#1722
@Zachu
Copy link

Zachu commented Jan 20, 2022

I was also experiencing a memory issues in the last part of the image building with v1.7.0.

INFO[0380] Taking snapshot of full filesystem...        
Killed

I tried all kinds of combinations with --compressed-caching=false and removing the --reproducible flag, downgrading to v1.3.0 and stuff. I finally got the build to pass by using the --use-new-run flag.

--use-new-run

Use the experimental run implementation for detecting changes without requiring file system snapshots. In some cases, this may improve build performance by 75%.

So I guess you should put that into your toolbox while banging your head against the wall :)

@Idok-viber
Copy link

Idok-viber commented Oct 13, 2022

Also got this issue when building with v1.9.1.

INFO[0133] Taking snapshot of full filesystem...        
Killed

reverted back to v1.3.0 and it works.

@cforce
Copy link

cforce commented Nov 7, 2022

I am using 1.9.0 and it seems to eat quite some memory. With or without --compressed-caching=false, --use-new-run same issue sporadically
"The node was low on resource: memory. Container build was using 5384444Ki, which exceeds its request of 0. Container helper was using 24720Ki, which exceeds its request of 0.
"The node was low on resource: memory. Container helper was using 9704Ki, which exceeds its request of 0. Container build was using 6871272Ki, which exceeds its request of 0."

7 GB to build an simple image ? The memory consumption is ridiculous. Why does the same with standard docker just works with 1x40 x less memory request?

@gaatjeniksaan
Copy link

gaatjeniksaan commented Mar 23, 2023

Reiterating what I stated in #2275 as well:

We're having this issue as well with 1.9.1-debug. End size of the image should be ~9GB, but the kaniko build (on GKE) fails due to limit in memory. See attached image to share in my agony.
image (5)

@tamer-hassan
Copy link

tamer-hassan commented Apr 3, 2023

Had this issue with kaniko v1.8.0-debug, also tried v1.3.0-debug, same issue. killed or evicted pod due to memory pressure on the (previously idle) node. This was the case when building an image nearly 2.5GB large, with the --cache=true flag.

Solution for me was to use v1.9.2-debug with the following options:
--cache=true --compressed-caching=false --use-new-run --cleanup

Further advise (from research of other previous issues):
DO NOT use the flags --single-snapshot or --cache-copy-layers

@codezart
Copy link

codezart commented May 3, 2023

I've got the same issue. For my case, I'm using a git context and cloning it itself takes 10Gi+ and gets killed before initiating the build on the latest versions. I tried with a node with more than 16Gi and it worked 1 out of 3 times.

@cforce
Copy link

cforce commented May 6, 2023

Kanicko feel like dead, i propose to switch to podman

@jonaskello
Copy link

We have the same problem, get this in gitlab ci:

INFO[0172] Taking snapshot of full filesystem...        
Killed
Cleaning up project directory and file based variables
ERROR: Job failed: command terminated with exit code 137

@aaron-prindle aaron-prindle added the priority/p1 Basic need feature compatibility with docker build. we should be working on this next. label Jun 8, 2023
@starkmatt
Copy link

starkmatt commented Jun 13, 2023

Solution for me was to use v1.9.2-debug with the following options:
--cache=true --compressed-caching=false --use-new-run --cleanup

This worked for me,
Thank you very much,

FYI For anyone else running into this.

@zzzinho
Copy link

zzzinho commented Jul 10, 2023

I have the same problem in v1.12.1-debug

INFO[0206] Taking snapshot of full filesystem...        
Killed

@ricardojdsilva87
Copy link

ricardojdsilva87 commented Jul 12, 2023

Hello everyone just to give my input, here are some CPU/RAM metrics with different kaniko versions.

Just to clarify the container where the build runs is using github actions hosted runners with 2core and 4GB RAM

Picture 1 - kaniko 1.9.2-debug with cache enabled --> Push failed with message Killed
image

Picture 2 - kaniko 1.9.2-debug with cache enabled and these settings --compressed-caching=false --use-new-run --cleanup Push failed with message Killed
image

Picture 3 - kaniko 1.12.1-debug with cache enabled and these settings --compressed-caching=false --use-new-run --cleanup Push failed with message Killed
image

Picture 4 - kaniko 1.3.0-debug with cache enabled (the flag --compressed-caching is not supported in this version) --> Push WORKS
image

The resulting image has around 500MB and the container uses around 1 core and less than the memory limit of the container (4GB). This build works if we increase the memory limit to 16GB that is an overkill and a waste of resources. The jobs that are killed are in fact using almost half of the memory (~2GB) of the job that was successful (3GB)

I would say that something broke kaniko starting on version 1.3.0, but even with all the flags set the builds do not work and the memory usage is way less than with v1.3.0 (Update the builds started to fail from version v1.9.1)

Thanks for your help

UPDATE
Tested also other older kaniko versions.
with kaniko 1.5.2-debug with cache enabled
image

with kaniko 1.5.2-debug with cache enabled
image

with kaniko 1.6.0-debug with cache enabled
image

with kaniko 1.8.1-debug with cache enabled
image

with kaniko 1.9.0-debug with cache enabled
image

Starting with kaniko v1.9.1 the builds started to fail
image

@droslean
Copy link

Same here. My build process takes around 1.5-1.8GB of memory to build, but when I run the Dockerfile via kaniko it needs 5GB which is absurd!!!!

Is there any solution here?

@cforce
Copy link

cforce commented Aug 29, 2023

I encourage to use podman

@droslean
Copy link

@aaron-prindle Any ideas?

@timwsuqld
Copy link

I can confirm using 1.3.0 works for us (with --force as we have v2 cgroups), while 1.14.0 fails. I've not tested every version in between. Final image size is 980Mb, build machine has 4GB of ram.

@ensc
Copy link

ensc commented Jun 13, 2024

same here with Kaniko version : v1.23.0

Snapshotting itself works but when sending results to the registry (gitlab), executor gets killed

kernel: Out of memory: Killed process 9503 (executor) total-vm:53191912kB, anon-rss:31445632kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:63580kB oom_score_adj:0

Results on registry are around 6 GB, extracted filesystem takes around 14 GB. Last words are

$ . /opt/sdk/environment-setup-cortexa9t2hf-neon-oe-linux-gnueabi
INFO[0342] Taking snapshot of full filesystem...        
INFO[0561] USER	build-user:build-user                   
INFO[0561] Cmd: USER                                    

RSS immediately after this output is around 700MB and quickly increases then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance issues related to kaniko performance enhancement categorized differs-from-docker issue/oom kind/question Further information is requested priority/p1 Basic need feature compatibility with docker build. we should be working on this next. regression/v1.2.0 regression works-with-docker
Projects
None yet
Development

No branches or pull requests