Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption with v1.25.2 #3443

Open
smartaquarius10 opened this issue Jan 31, 2023 · 152 comments
Open

High memory consumption with v1.25.2 #3443

smartaquarius10 opened this issue Jan 31, 2023 · 152 comments

Comments

@smartaquarius10
Copy link

smartaquarius10 commented Jan 31, 2023

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.

Regards,
Tanul


My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment.
Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

  • Dameon set of AAD pod identity:- Taking total 191 Mi of memory
  • Total 2 pods of kong :- Taking total 914 Mi Memory
  • Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory
  • Total 10 pods of our .net microservices:- Taking total 820 Mi of memory
@xuanra
Copy link

xuanra commented Feb 1, 2023

Hello,
We have the same problem with 1.25.4 version in our Company AKS.

We are trying to upgrade an app to openjdk17 to check if this new LTS Java version mitigates the problem.

Edit: In our case, .Net apps needed to change the nugget package for Application Insights.

Greets,

@smartaquarius10
Copy link
Author

@xuanra , My major pain point is these 2 pods out of 9 of them

  • ama-logs
  • ama-logs-rs
    They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

  • csi-azuredisk-node
  • csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

@lsavini-orienteed
Copy link

Hello,
we are facing the same problem of memory spikes moving from v1.23.5 to v1.25.4.
We had to increase the memory limit of most of the containers

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 2, 2023

@miwithro @ritazh @Karishma-Tiwari-MSFT @CocoWang-wql @jackfrancis @mainred

Hello,

Extremely sorry for tagging you. But our whole non prod environment is not working. We haven't upgraded our prod environment yet. However, engineers are unable to work on their applications.

Few days back, we have approached customer support for node performance issues but did not get any good response.

Would be really grateful for help and support on this as it seems to be a global problem.

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 2, 2023

I need to share one finding. I have just created 2 different AKS clusters with v1.24.9 and v1.25.4 with 1 nodes of Standard B2s

These are the metrics. In case of v 1.25.4 there is a huge spike after enabling monitoring.

image

@cedricfortin
Copy link

We've got the same problem with memory after upgrading AKS from version 1.24.6 to 1.25.4:

In the monitoring of memory for the last month of one of our deployment, we can clearly see the memory usage increase after the update (01/23):
imagen

@xuanra
Copy link

xuanra commented Feb 3, 2023

Hello,
Our cluster has D4s_v3 machines.
We still haven't found any patron in the apps that raised the memory demanded and the apps they don't between all our Java and .Net pods.
One alternative to upload Java from 8 to 17 that one of our providers told us is to upload the version of our VM from D4s_v3 to D4s_v5 and we are studing the impact of this change.

Greets,

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 6, 2023

@xuanra , I think in that case B2s are totally out of picture for this upgrade.. The max they are capable of supporting is till 1.24.x version of AKS

@ganga1980
Copy link

@xuanra , My major pain point is these 2 pods out of 9 of them

  • ama-logs
  • ama-logs-rs
    They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

  • csi-azuredisk-node
  • csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

Hi, @smartaquarius10 , Thanks for the feedback. We have work planned to reduce the ama-logs agent memory foot print and we will update the exact timelines and additional details of the improvements in early March. cc: @pfrcks

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 13, 2023

@ganga1980 @pfrcks

Thank you so much Ganga.. We are heavily impacted because of this. Till 1.24.x version of AKS we were running 3 environments within our AKS. But, after upgrading to 1.25.x version we are unable to manage even 1 environment.

Each environment has 11 pods.

Would be grateful for your support on this. I have already disabled the csi pods as we are not using any storage. For now, should we disable these ama monitoring pods as well..

If yes, then once your team resolve these issues should we upgrade our AKS again to some specific version or microsoft will resolve from backend in every version of AKS infra.

Thank you

Kind Regards,
Tanul

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 24, 2023

Hello @ganga1980 @pfrcks ,

Hope you are doing well. By any chance, is it possible to speed up the process a little.. Actually our 2 environments (which is 22 micro services) are down because of this.

Appreciate your help and support in this matter. Thank you. Have a great day.

Hello @xuanra @cedricfortin @lsavini-orienteed,
Did you find any workaround for this. Thanks :)

Kind Regards,
Tanul

@gonpinho
Copy link

Hi @smartaquarius10, we updated the k8s version of AKS to 1.25.5 this week and start suffering from the same issue.

In our case, we identified a problem with the JRE version when dealing with cgroups v2. Here I share my findings:

Kubernetes cgroups v2 reached GA on the version 1.25.x and with this change AKS changed the OS of the nodes from Ubuntu18.04 to Ubuntu22.04 that already uses cgroups v2 by default.

The problem of our containarized apps were related with a bug on JRE 11.0.14. This JRE didn't had support for cgroups v2 container awareness. This means that the container were not able to respect the imposed memory quotas defined on the deployment descriptor.

Oracle and OpenJDK addressed this issue by supporting it natively on JRE 17 and backporting this fix to JRE 15 and JRE 11.0.16++.

I've updated the base image to use a fixed JRE version (11.0.18) and the memory exhaustion was solved.

Regarding AMA pods, I've compared the pods running on k8s 1.25.X with the pods running on 1.24.X and in my opinion seems stable as the memory footprint is literally the same.

Hope this helps!

@smartaquarius10
Copy link
Author

smartaquarius10 commented Feb 24, 2023

@gonpinho , Thanks a lot for sharing the details. But the problem is that our containerized apps are not taking extra memory.. They are still occupying the same as they were taking before with 1.24.x..

What I realized is that I have created a fresh cluster 1.24.x and 1.25.x and by default memory occupancy is appox. 30% more in 1.25.x..

My one environment takes only 1 GB of memory consisting of 11 pods.. With AKS 1.24.x I was running 3 environments in total. The moment I shifted to 1.25.x I have to disable 2 environments along with the microsoft CSI addons as well just to accommodate the 11 custom pods because the node memory consumption is already high.

@smartaquarius10
Copy link
Author

@gonpinho , By any chance if I can downgrade the OS again to ubuntu 18.0.4 then it would be my first preference. I know that upgrade to ubuntu OS is killing the machines. No idea how to handle this.

@pintmil
Copy link

pintmil commented Mar 2, 2023

Hi, we facing with the same problem after upgrading our dev AKS cluster to 1.25.5 from 1.23.12. Our company develops c/c++ and c# services, so we don't suffer from JRE cgroup v2 issues. We see that memory usage is increasing over time, but nothing - just kube-system pods - are running on the cluster.
The sympthoms is that kubectl top no shows much more memory consumption than free on the host OS (ubuntu 22.04). If we force host OS to drop cached memory with the command sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches' the used memory isn't changing but some of the buff/cache memory moves to free, and after it the kubectl top no shows memory usage drop on that node.
We came to conclusion, that k8s calculates buff/cache memory into used memory, but it is wrong, because linux OS will use free memory to buffer IO and other things, and it is completely normal operation.

kubectl top no before cache drop:
Screenshot_20230302_104524

free before / after cache drop:
Screenshot_20230302_104702

kubectl top no after cache drop:
Screenshot_20230302_104737

@shiva-appani-hash
Copy link

Team, we are seeing the same behaviour after upgrading the cluster from 1.23.12 to 1.25.5. All the microservices running in clusters are .Net3.1. On raising a support request, we got to know that cgroup version has been changes to v2, does anyone have similar scenario.
How do we identify cgroup v1 is used in .net 3.1 and can it be the cause for high memory consumption,

@smartaquarius10
Copy link
Author

Hello @ganga1980, Any update on this please.. Thank you

@ganga1980
Copy link

Hello @ganga1980, Any update on this please.. Thank you
@smartaquarius10 , We are working on rolling out our March agent release, which would bring down the usage ama-logs daemonset (linux) by 80 to 100MB. I dont have your cluster name or cluster resource id to investigate and we cant repro the issue you have reported. Please create an support ticket with clusterResourceId details so that we can investigate.
The workaround you can try applying the default configmap through kubectl apply -f https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml

@smartaquarius10
Copy link
Author

smartaquarius10 commented Mar 7, 2023

@ganga1980 , Thank you for the reply. Just a quick question. After raising the support ticket should I send a mail to you on your microsoft id with the details regarding support ticket. Otherwise, it will assign to L1 support which will take a lot of time to get to the resolution.

Or else, if you allow, I can ping you my cluster details on MS teams.

The way you like 😃

Currently, ama pods are taking approx. 326Mi of memory/node

@smartaquarius10
Copy link
Author

smartaquarius10 commented Mar 7, 2023

@ganga1980, We already have this config map

@andyzhangx
Copy link
Contributor

@ganga1980 for the csi driver resource usage, if you don't need csi driver, you could disable those drivers, follow by: https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers#disable-csi-storage-drivers-on-a-new-or-existing-cluster

@Marchelune
Copy link

Hi! It seems we are facing the same issue in 1.25.5. We upgraded a few weeks (24.02) ago and the memory usage (container working set memory) jumped from the moment of the upgrade, according to the metrics tab:
Screenshot 2023-03-09 at 18 06 54 copy

We are using Standard_B2s vms, as this is an internal development cluster - csi drivers are not enabled.
Is the issue identified or is there still an investigation on this?

@codigoespagueti
Copy link

Same issue here after upgrading to 1.25.5.
We are using FS2_v2 and we were not able to have the Working Set memory below 100% no matter how many nodes we added to the cluster.

Very dissapointing that all the memory in the Node is used and reserved by Azure Pods.

We had to disable Azure Insights in the cluster.

image

@ghost
Copy link

ghost commented Mar 10, 2023

@vishiy, @saaror would you be able to assist?

Issue Details

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.


My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment.
Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

  • Dameon set of AAD pod identity:- Taking total 191 Mi of memory
  • Total 2 pods of kong :- Taking total 914 Mi Memory
  • Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory
  • Total 10 pods of our .net microservices:- Taking total 820 Mi of memory
Author: smartaquarius10
Assignees: -
Labels:

bug, azure/oms, addon/container-insights

Milestone: -

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@Oceanswave
Copy link

We're seeing this as well, and the ama-metrics and ama-logs pods are hitting their AKS configured memory limits, getting terminated and restarting.

We've got 4,800+ entries of ama-metrics-operator-* terminations in the past week. Any advice or recommendations here would be useful.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@smartaquarius10
Copy link
Author

Closing this issue.

@Marchelune
Copy link

I'm sorry, I may have missed a development of this issue, but is the high memory consumption reporting problem resolved now ?

@mhkolk
Copy link

mhkolk commented Apr 23, 2024

I'm sorry, I may have missed a development of this issue, but is the high memory consumption reporting problem resolved now ?

The problem is not resolved, we are seeing this issue, high memory consumption from kube-system at ama-metrics nodes as we speak even after disabling metrics the way @marekr described.

image

@marcindulak
Copy link

Please reopen @tanulbh - people piggybacked on your issue report.

@gyorireka
Copy link

@smartaquarius10 could you please update us?

@smartaquarius10
Copy link
Author

@marcindulak @gyorireka Sure.. Reopening the issue.

@deyanp
Copy link

deyanp commented Jun 5, 2024

I am seeing on a 3-node AKS cluster this:

NAME                                            CPU(cores)   MEMORY(bytes)   
ama-logs-4vmcz                                  4m           185Mi           
ama-logs-9f4r9                                  3m           199Mi           
ama-logs-jc7cr                                  3m           198Mi           
ama-logs-rs-794b9b5b76-k5nr7                    7m           250Mi           
ama-metrics-5bf4d7dcc8-sg6cq                    14m          215Mi           
ama-metrics-ksm-d9c6f475b-bf94k                 2m           40Mi            
ama-metrics-node-kcph9                          9m           269Mi           
ama-metrics-node-r6c4v                          12m          212Mi           
ama-metrics-node-s8j8l                          12m          204Mi           
ama-metrics-operator-targets-7c4bf58f46-7c64j   1m           38Mi   

and 200-300Mi multiplied by all the pods is too much as a whole only for pushing logs or metrics out ...

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@smartaquarius10
Copy link
Author

I think now we cannot use B series machines with AKS.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@EvertonSA
Copy link

in case it helps, we are on 1.30.0 running majority CBL-Mariner nodes

this is a dev cluster, not much happening here as we are using another log solution (grafana loki)

it seems image used is mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.22

open telemetry is not enabled.

image

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@monotek
Copy link

monotek commented Aug 12, 2024

Seems back to normal with the 1.29 update.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

@ganga1980, @saaror would you be able to assist?

@brgrz
Copy link

brgrz commented Sep 25, 2024

in case it helps, we are on 1.30.0 running majority CBL-Mariner nodes

this is a dev cluster, not much happening here as we are using another log solution (grafana loki)

it seems image used is mcr.microsoft.com/azuremonitor/containerinsights/ciprod:3.1.22

open telemetry is not enabled.

image

100% can confirm these numbers and they are insane. Dozens of AMA pods consuming literally GBs of memory (on 8 and 16 GBs node VMs) bc of which we've had constant memory pressure on our nodes.

no solution from MS, not even solution from our dedicated support so I had enough and did this:

Disable Managed Prometheus

az aks update --disable-azure-monitor-metrics -n <<namespace>> -g <<resource-group>>
Disable Container insights

az aks disable-addons -a monitoring -n <<namespace>> -g <<resource-group>>

Memory consumption on all nodes went down -20% and I just cut our Azure Log Analytics costs by a couple hundred euros per month. We'll deploy standalone Grafana and Loki to do this instead of the managed solution.

@deyanp
Copy link

deyanp commented Sep 26, 2024

ama-logs pods are clearly too memory and cpu hungry for absolutely no reason (low volume clusters). Probably written in .Net, and probably not well-written. Compare this with other infrastructure pods written in Go and consuming single digit CPU and double digit memory ... saying this as a .Net developer ....

@beasteers
Copy link

beasteers commented Dec 11, 2024

Yeah this is absurd. I thought we'd lighten the load on our cluster by moving prometheus externally, but now it's impossible to schedule pods.

Is there any way to set lower requests/limits on the ama pods?

It's so funny how monitoring systems always break things. I don't get why an exporter would need so much resources...

Are there any ways to monitor without costing an arm and a leg in resource usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests