-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory consumption with v1.25.2 #3443
Comments
Hello, We are trying to upgrade an app to openjdk17 to check if this new LTS Java version mitigates the problem. Edit: In our case, .Net apps needed to change the nugget package for Application Insights. Greets, |
@xuanra , My major pain point is these 2 pods out of 9 of them
My other pain point is these 16 pods(8 each)
They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep. Still looking for the better solution to handle the non-prod environment... |
Hello, |
@miwithro @ritazh @Karishma-Tiwari-MSFT @CocoWang-wql @jackfrancis @mainred Hello, Extremely sorry for tagging you. But our whole non prod environment is not working. We haven't upgraded our prod environment yet. However, engineers are unable to work on their applications. Few days back, we have approached customer support for node performance issues but did not get any good response. Would be really grateful for help and support on this as it seems to be a global problem. |
Hello, Greets, |
@xuanra , I think in that case B2s are totally out of picture for this upgrade.. The max they are capable of supporting is till 1.24.x version of AKS |
|
Thank you so much Ganga.. We are heavily impacted because of this. Till 1.24.x version of AKS we were running 3 environments within our AKS. But, after upgrading to 1.25.x version we are unable to manage even 1 environment.
Would be grateful for your support on this. I have already disabled the csi pods as we are not using any storage. For now, should we disable these ama monitoring pods as well.. If yes, then once your team resolve these issues should we upgrade our AKS again to some specific version or microsoft will resolve from backend in every version of AKS infra. Thank you Kind Regards, |
Hello @ganga1980 @pfrcks , Hope you are doing well. By any chance, is it possible to speed up the process a little.. Actually our 2 environments (which is 22 micro services) are down because of this. Appreciate your help and support in this matter. Thank you. Have a great day. Hello @xuanra @cedricfortin @lsavini-orienteed, Kind Regards, |
Hi @smartaquarius10, we updated the k8s version of AKS to 1.25.5 this week and start suffering from the same issue. In our case, we identified a problem with the JRE version when dealing with cgroups v2. Here I share my findings: Kubernetes cgroups v2 reached GA on the version 1.25.x and with this change AKS changed the OS of the nodes from Ubuntu18.04 to Ubuntu22.04 that already uses cgroups v2 by default. The problem of our containarized apps were related with a bug on JRE 11.0.14. This JRE didn't had support for cgroups v2 container awareness. This means that the container were not able to respect the imposed memory quotas defined on the deployment descriptor. Oracle and OpenJDK addressed this issue by supporting it natively on JRE 17 and backporting this fix to JRE 15 and JRE 11.0.16++. I've updated the base image to use a fixed JRE version (11.0.18) and the memory exhaustion was solved. Regarding AMA pods, I've compared the pods running on k8s 1.25.X with the pods running on 1.24.X and in my opinion seems stable as the memory footprint is literally the same. Hope this helps! |
@gonpinho , Thanks a lot for sharing the details. But the problem is that our containerized apps are not taking extra memory.. They are still occupying the same as they were taking before with 1.24.x.. What I realized is that I have created a fresh cluster 1.24.x and 1.25.x and by default memory occupancy is appox. 30% more in 1.25.x.. My one environment takes only 1 GB of memory consisting of 11 pods.. With AKS 1.24.x I was running 3 environments in total. The moment I shifted to 1.25.x I have to disable 2 environments along with the microsoft CSI addons as well just to accommodate the 11 custom pods because the node memory consumption is already high. |
@gonpinho , By any chance if I can downgrade the OS again to ubuntu 18.0.4 then it would be my first preference. I know that upgrade to ubuntu OS is killing the machines. No idea how to handle this. |
Team, we are seeing the same behaviour after upgrading the cluster from 1.23.12 to 1.25.5. All the microservices running in clusters are .Net3.1. On raising a support request, we got to know that cgroup version has been changes to v2, does anyone have similar scenario. |
Hello @ganga1980, Any update on this please.. Thank you |
|
@ganga1980 , Thank you for the reply. Just a quick question. After raising the support ticket should I send a mail to you on your microsoft id with the details regarding support ticket. Otherwise, it will assign to L1 support which will take a lot of time to get to the resolution. Or else, if you allow, I can ping you my cluster details on MS teams. The way you like 😃
|
@ganga1980, We already have this config map |
@ganga1980 for the csi driver resource usage, if you don't need csi driver, you could disable those drivers, follow by: https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers#disable-csi-storage-drivers-on-a-new-or-existing-cluster |
Same issue here after upgrading to 1.25.5. Very dissapointing that all the memory in the Node is used and reserved by Azure Pods. We had to disable Azure Insights in the cluster. |
@vishiy, @saaror would you be able to assist? Issue DetailsTeam, Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues. Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine. Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue. I have also observed that AKS 1.24.x version had Kindly suggest.
|
Issue needing attention of @Azure/aks-leads |
We're seeing this as well, and the ama-metrics and ama-logs pods are hitting their AKS configured memory limits, getting terminated and restarting. We've got 4,800+ entries of ama-metrics-operator-* terminations in the past week. Any advice or recommendations here would be useful. |
Issue needing attention of @Azure/aks-leads |
Closing this issue. |
I'm sorry, I may have missed a development of this issue, but is the high memory consumption reporting problem resolved now ? |
The problem is not resolved, we are seeing this issue, high memory consumption from kube-system at ama-metrics nodes as we speak even after disabling metrics the way @marekr described. |
Please reopen @tanulbh - people piggybacked on your issue report. |
@smartaquarius10 could you please update us? |
@marcindulak @gyorireka Sure.. Reopening the issue. |
I am seeing on a 3-node AKS cluster this:
and 200-300Mi multiplied by all the pods is too much as a whole only for pushing logs or metrics out ... |
Issue needing attention of @Azure/aks-leads |
I think now we cannot use B series machines with AKS. |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Seems back to normal with the 1.29 update. |
Issue needing attention of @Azure/aks-leads |
@ganga1980, @saaror would you be able to assist? |
100% can confirm these numbers and they are insane. Dozens of AMA pods consuming literally GBs of memory (on 8 and 16 GBs node VMs) bc of which we've had constant memory pressure on our nodes. no solution from MS, not even solution from our dedicated support so I had enough and did this: Disable Managed Prometheus
Memory consumption on all nodes went down -20% and I just cut our Azure Log Analytics costs by a couple hundred euros per month. We'll deploy standalone Grafana and Loki to do this instead of the managed solution. |
ama-logs pods are clearly too memory and cpu hungry for absolutely no reason (low volume clusters). Probably written in .Net, and probably not well-written. Compare this with other infrastructure pods written in Go and consuming single digit CPU and double digit memory ... saying this as a .Net developer .... |
Yeah this is absurd. I thought we'd lighten the load on our cluster by moving prometheus externally, but now it's impossible to schedule pods. Is there any way to set lower requests/limits on the ama pods? It's so funny how monitoring systems always break things. I don't get why an exporter would need so much resources... Are there any ways to monitor without costing an arm and a leg in resource usage? |
Team,
Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.
Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.
Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.
I have also observed that AKS 1.24.x version had
ubuntu 18
but AKS 1.25.x version hasubuntu 22
. Is this the reason behind high memory consumption.Kindly suggest.
Regards,
Tanul
The text was updated successfully, but these errors were encountered: