3.6.2 M4: stats DB RAM use grows over the course of a few hours #185

michaelklishin · 2016-04-20T10:32:55Z

Moved from rabbitmq/rabbitmq-server#761:

I have encountered an issue where a two node PCF RabbitMQ (3.6.1.904, Erlang 18.1) cluster in AWS under load is having its queues blocked due to high memory usage on the node hosting the stats. The screenshot bellow illustrates this.

*Configuration *

There are 2 servers which have all queues mirrored and the following modification to the collection on stats applied:

[
{rabbit, [ {collect_statistics, none},
{collect_statistics_interval, 60000}] },
{rabbitmq_management, [ {rates_mode, none}] }
].

I used PerfTest to place the cluster under load using the following: ./runjava.sh -Xms2G com.rabbitmq.examples.PerfTest -h amqp://user:password@192.168.0.10/vhost_name -r 1 -R 1000 -x 10 -y 500

The intention of the above was to weight in favor of the consumers to promote throughout of messages.

* Issue Description*

After about an hour of running under this load, the memory on the node hosting the stats db steadily increases to the point where it hits the 6Gb high memory watermark and starts to block the producers. This is an issue because whilst the node with the stats db is using over 6Gb of memory, the other node still has plenty of headroom and was only consuming 3Gb of memory.

The producers remain blocked until the statsdb processes (my assumption) the stats, and then you see it free the memory and producing continues as expected.

dcorbacho · 2016-04-21T10:15:22Z

I can reproduce it from stable. The problem is not in the data store but the event collectors.

The message queues of the event collectors seem empty when queried using erlang:process_info, but the memory consumption of rabbit_mgmt_channel_stats_collector reaches 2.4GB on its own. Forcing the garbage collection does not solve it.

I crashed the node by requesting the process state with sys:get_state, as it seemed to reach the system in a moment with a large burst of events in the queue and it couldn't allocate enough memory. Investigation ongoing.

dcorbacho · 2016-04-22T18:06:05Z

The processing of the stats causes a massive amount of reductions, mainly from the two functions called here: https://github.com/rabbitmq/rabbitmq-management/blob/master/src/rabbit_mgmt_event_collector_utils.erl#L276, both as recursive functions or using the previous (3.6.0) implementation as list comprehensions.

The new event collectors in #41 where no longer set to high priority, thus the scheduling was taking place very often and caused the memory built-up. With the priority set to high again the memory keeps stable.

I reduced too the buffer size on channel and queue stats before start dropping them, as the fact that now we have three collectors can potentially cause a much larger joint message queue.

y123456yz · 2016-09-06T03:49:56Z

I have also the same problem, when I do Short connection pressure measurement
some nodes will exit

michaelklishin · 2016-09-06T09:45:23Z

Please post questions to rabbitmq-users or Stack Overflow. RabbitMQ uses GitHub issues for specific actionable items engineers can work on, not questions. Thank you.

michaelklishin · 2018-03-27T18:50:16Z

This issue has been fundamentally addressed in #236 (and shipped in 3.6.7). The guide on Memory Usage was significantly expanded since April 2016. Please upgrade to at least 3.6.15 and use the tools described in the guide to collect relevant data.

michaelklishin assigned dcorbacho Apr 20, 2016

michaelklishin added the bug label Apr 20, 2016

michaelklishin added this to the 3.6.2 milestone Apr 20, 2016

dcorbacho mentioned this issue Apr 22, 2016

Set priority of collectors to high #186

Merged

michaelklishin closed this as completed Apr 23, 2016

rabbitmq locked and limited conversation to collaborators Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.6.2 M4: stats DB RAM use grows over the course of a few hours #185

3.6.2 M4: stats DB RAM use grows over the course of a few hours #185

michaelklishin commented Apr 20, 2016

dcorbacho commented Apr 21, 2016

dcorbacho commented Apr 22, 2016

y123456yz commented Sep 6, 2016 •

edited

Loading

michaelklishin commented Sep 6, 2016

michaelklishin commented Mar 27, 2018

3.6.2 M4: stats DB RAM use grows over the course of a few hours #185

3.6.2 M4: stats DB RAM use grows over the course of a few hours #185

Comments

michaelklishin commented Apr 20, 2016

dcorbacho commented Apr 21, 2016

dcorbacho commented Apr 22, 2016

y123456yz commented Sep 6, 2016 • edited Loading

michaelklishin commented Sep 6, 2016

michaelklishin commented Mar 27, 2018

y123456yz commented Sep 6, 2016 •

edited

Loading