Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise how rdq files are scanned at CQ shared message store recovery startup #11072

Closed
gomoripeti opened this issue Apr 24, 2024 · 21 comments
Closed
Milestone

Comments

@gomoripeti
Copy link
Contributor

Is your feature request related to a problem? Please describe.

The symptom is that after an unclean shutdown when RabbitMQ starts up and runs recovery steps, its memory usage spikes to a seemingly unreasonable way. On smaller servers with let's say only 2GB of memory it causes an OOM (cyclic restarts and the node cannot start up). We have good reason to believe that it is the recovery of the classic queue shared storage, when it reads in all the rdq files, that is responsible for this behaviour. Although there are large rdq files, the total message size of "ready" messages in queues is 1 or 2 orders less than the memory used.

This issue was observer since years from time to time including on 3.12.x

Describe the solution you'd like

I would like to get early feedback from the Core Team if it is worth investigating (we definitely see this issue often enough) and if contributions would be accepted.

The function rabbit_msg_store:scan_file_for_valid_messages is used to get the MsgIds and their offsets in the file, but at the same time also the message bodies are read in which are thrown away. This might lead to extra large binaries which are not used and extra garbage.

  • First it needs to be investigated that indeed this function is the source of excessive memory usage.
  • If proven implement a scanning which does not concatenate binaries, and does not create large, unused binaries (instead skipping message bodies)

Part of the problem could be the presence of unreferenced messages in rdq files which are not yet removed from the file. When a server manages to start up after a successful recovery, we've seen the size of the rdq files shrunk. The GC of unreferenced entries changed considerably in 3.13 where the msg store uses a compaction mechanism instead of combining two files. But still there were seen issues (#10681), so I believe this change could be still relevant. The scan_file_for_valid_messages function is not just used at recovery time, but also during compaction and rdq deletion.

Describe alternatives you've considered

No response

Additional context

No response

@lhoguin
Copy link
Contributor

lhoguin commented Apr 24, 2024

I would expect the unused message to be a sub-binary, but perhaps it isn't. The function mirrors what was done before 3.13. We have not seen extra memory during testing as a result of compaction in any case.

Either way I am currently fixing crashes related to this so I will investigate at the same time.

@lhoguin
Copy link
Contributor

lhoguin commented Apr 24, 2024

Let me rephrase: we have no seen extra memory in 3.13.1. We did fix a bug in 3.13.0 that lead to compaction being very slow, using lots more memory than normal. But for 3.13.1 I wouldn't expect much extra memory.

@gomoripeti
Copy link
Contributor Author

thanks for the prompt response!

It is possible that the issue is only visible during startup/recovery, when all the rdq files are scanned. (So for combine and delete of one file at a time it would be an unnoticeable improvement)

In case of messages larger than 4MB, let's say for a 16MB message, the file is read in 4MB chunks and 4 of those chunks are concatenated to reach the end_marker. In this case it is not just sub-binaries. So maybe the issue manifests only when there are some large enough messages in the rdq files. (I know it is not good practice to have megabyte of message bodies but unfortunately we see them from time to time even on smaller servers)

I will gather some example rdq files and an exact reproduction with how large a memory spike really is.

@lhoguin
Copy link
Contributor

lhoguin commented Apr 24, 2024

We currently do not test with above 1MB messages so I would not be surprised if there were issues there. Looking forward to the data!

@gomoripeti
Copy link
Contributor Author

I have only some initial information

One incident about a week ago happened on a 3-node cluster on RabbitMQ 3.12.12 with 2 schedulers and ~1.8GB RAM. No info about rdq files unfortunately. (Only relevant because it is relatively new version, but not 3.13)
Another incident about 3 weeks ago where a 1-node cluster on RabbitMQ 3.11.8 with 2 schedulers and 1GB RAM went OOM. In its message store it had only 2 rdq files: a 57MB and 60MB. Both contained a couple of small messages and a roughly 55MB one.

I tried some reproductions locally on a recent main @ 57f9aec
(8 schedulers and 8 worker_pool_workers)

Steps

  • make -C <repo> run-broker
  • <repo>/sbin/rabbitmqctl stop_app
  • cd <rabbit_db:dir()>/msg_stores/vhosts/<vhost>/msg_store_persistent/
  • rm -f * to remove artefacts of a clean shutdown
  • for i in $(seq 1 8); do cp <large>.rdq $i.rdq; done I duplicated the 60MB file a few times
  • <repo>/sbin/rabbitmqctl start_app
  • I inspected in observer UI "binary_alloc Max Carrier size (kB)" (as the spike is so short sometimes it is not even visible on the memory graph)

Test cases

  • just copied the two files from the incident: Max Carrier size ~120MB - equivalent with the total size of the rdq files
  • 8 times the 60MB rdq file (total 476MB): Max Carrier size 956_448 kB - about double the total rdq size
  • 16 times the 60MB rdq file (total 953MB): Max Carrier size 1_735_600 kB - about double the total rdq size

So to sum up, on latest main I don't see that huge "exponential" or orders of magnitude higher memory consumption. (There is room for some improvement but not as huge as I expected)
Maybe the use of file handle cache that was removed in 3.13 made the situation worse. Maybe the very small size of the worker_pool matters (2 instead of 8).

I will continue investigating.

@gomoripeti
Copy link
Contributor Author

Now I have a reproduction, at least a definition of high memory consumption

Reproduction

  • Start RabbitMQ 3.13.2 on an instance with ~0.8GB RAM
  • Publish 1 large message 110 MiB
    java -jar perf-test.jar--producers 1 --consumers 0  --queue cq3 -ad false -f persistent true -h  $CAMQP -C 1  --size 110000000
    
    This creates a 0.rdq file with size 110004411 B
  • Force stop RabbitMQ
    sudo kill -9 $(pidof beam.smp)
    
  • Restart RabbitMQ - it runs out of memory during startup

After obtaining the rdq file a simpler reproductions is to just call the scanning function directly

# cd /var/lib/rabbitmq/mnesia/.../msg_store_persistent
# erl -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.13.2/plugins/rabbit-3.13.2/ebin/ -eval 'io:format("res ~p~n", [rabbit_msg_store:scan_file_for_valid_messages("0.rdq")]).' -noshell
Killed

In syslog I see that the RSS memory of the beam process is above 600MB

Out of memory: Killed process 511918 (beam.smp) total-vm:3194180kB, anon-rss:699512kB, file-rss:0kB, shmem-rss:256kB, UID:0 pgtables:2268kB oom_score_adj:0

I think what happens is that the 105MB message body is appended from 4MB pieces, and very often there is a reallocation, which leaves a lot of fragmentation behind. (I created a script that terminates the beam process when it goes above 500MB RSS and create a crash_dump. I don't see in that crash dump where that 600MB is allocated by the memory carriers.
erl_crash.dump.tar.gz
)

Some inspection details
  • inspected the byte size and referenced size of the appended binary during scanning with the following printf debugging

    modified   deps/rabbit/src/rabbit_msg_store.erl
    @@ -1421,6 +1421,11 @@ scan(Buffer, Fd, Offset, FileSize, MsgIdsFound, Acc) ->
                   <<>> -> Data0;
                   _ -> <<Buffer/binary, Data0/binary>>
               end,
    +          erlang:garbage_collect(self()),
    +          io:format("~.1fMB ~.1fMB~n",
    +                    [(byte_size(Data) div 1024) / 1024,
    +                     (binary:referenced_byte_size(Data) div 1024) / 1024
    +                    ]),
               scan_data(Data, Fd, Offset, FileSize, MsgIdsFound, Acc)
       end.
    

    The referenced binaries are not much larger than the useful data:

    > spawn(fun() ->  rabbit_msg_store:scan_file_for_valid_messages("0.rdq") end).
    <0.302.0>
    4.0MB 4.0MB
    8.0MB 16.0MB
    12.0MB 24.0MB
    16.0MB 32.0MB
    20.0MB 24.0MB
    24.0MB 28.8MB
    28.0MB 33.6MB
    32.0MB 38.4MB
    36.0MB 43.2MB
    40.0MB 48.0MB
    44.0MB 52.8MB
    48.0MB 57.6MB
    52.0MB 62.4MB
    56.0MB 67.2MB
    60.0MB 72.0MB
    64.0MB 76.8MB
    68.0MB 81.6MB
    72.0MB 86.4MB
    76.0MB 91.2MB
    80.0MB 96.0MB
    84.0MB 100.8MB
    88.0MB 105.6MB
    92.0MB 110.4MB
    96.0MB 115.2MB
    100.0MB 120.0MB
    104.0MB 124.8MB
    104.9MB 125.9MB
    
  • inspected what binaries are referenced during scanning

    modified   deps/rabbit/src/rabbit_msg_store.erl
    @@ -1421,6 +1421,7 @@ scan(Buffer, Fd, Offset, FileSize, MsgIdsFound, Acc) ->
                     <<>> -> Data0;
                     _ -> <<Buffer/binary, Data0/binary>>
                 end,
    +            io:format("~p~n~n", [process_info(self(), binary)]),
                 scan_data(Data, Fd, Offset, FileSize, MsgIdsFound, Acc)
         end.
    

    It is visible that each time there is a new binary reference so reallocation happens

    > spawn(fun() ->  rabbit_msg_store:scan_file_for_valid_messages("0.rdq") end).
    <0.189.0>
    {binary,[{5761253416,4194304,1}]}
    
    {binary,[{5924782120,16777216,1},
             {5761253416,4194304,1},
             {5905580072,4194304,1}]}
    
    {binary,[{5924782120,8388608,1},
             {5933187112,25165824,1},
             {5761253416,4194304,1}]}
    
    {binary,[{5933187112,12582912,1},
             {5945786408,33554432,1},
             {5761253416,4194304,1}]}
    
    {binary,[{5945786408,16777216,1},
             {5962580008,25165824,1},
             {5761253416,4194304,1}]}
    
    {binary,[{5962580008,20971520,1},
             {5983567912,30198988,1},
             {5761253416,4194304,1}]}
    
    {binary,[{5983567912,25165824,1},
             {6114050088,35232153,1},
             {5761253416,4194304,1}]}
    
    {binary,[{6114050088,29360128,1},
             {11005853736,40265318,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11005853736,33554432,1},
             {11039424552,45298483,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11039424552,37748736,1},
             {11077189672,50331648,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11077189672,41943040,1},
             {11119149096,55364812,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11119149096,46137344,1},
             {11165302824,60397977,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11165302824,50331648,1},
             {11215650856,65431142,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11215650856,54525952,1},
             {11270193192,70464307,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11270193192,58720256,1},
             {11328929832,75497472,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11328929832,62914560,1},
             {11391860776,80530636,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11391860776,67108864,1},
             {11458986024,85563801,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11458986024,71303168,1},
             {5924782120,90596966,1},
             {5761253416,4194304,1}]}
    
    {binary,[{5924782120,75497472,1},
             {11530305576,95630131,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11530305576,79691776,1},
             {11610013736,100663296,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11610013736,83886080,1},
             {11693916200,105696460,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11693916200,88080384,1},
             {11005853736,110729625,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11005853736,92274688,1},
             {11782012968,115762790,1},
             {5761253416,4194304,1}]}
    
    {binary,[{11878498344,120795955,1},
             {5905580072,4194304,1},
             {11782012968,96468992,1}]}
    
    {binary,[{11878498344,100663296,1},
             {11098144808,125829120,1},
             {5905580072,4194304,1}]}
    
    {binary,[{11098144808,104857600,1},
             {11979178024,130862284,1},
             {5905580072,4194304,1}]}
    
    {binary,[{11979178024,109051904,1},
             {11203018792,132005293,1},
             {5905580072,952507,1}]}
    

Patches

I created an initial patch which uses file:position to skip the message body if it is large enough (cloudamqp@d5b8e19). With that patch the OOM does not happen.

I was also curious if the same issue affects the code which actually reads a message from disk. But that code (rabbit_msg_store:reader_pread/2) reads the message in one go without chunking and appending.

Indeed if the scanner would also just read the large message in one go it does not OOM on the same small-memory instance either, as I tested it with the following patch cloudamqp@6c216aa

Is this all reasonable? Is the approach using file:position the right one, so I can polish that one into a PR?
(In case of an invalid message when the file is scanned skipping 1 byte at a time, jumping ahead and back many times can be inefficient, but I dont know if we should optimise for that case)

@lhoguin
Copy link
Contributor

lhoguin commented May 7, 2024

Thanks. I need to think about it.

@michaelklishin
Copy link
Member

110 MiB messages belong to a blob store, not to a RabbitMQ queues.

@gomoripeti
Copy link
Contributor Author

110 MiB messages belong to a blob store, not to a RabbitMQ queues.

absolutely, I also wrote earlier:

(I know it is not good practice to have megabyte of message bodies but unfortunately we see them from time to time even on smaller servers)

The default message size limit is 128MB (https://github.com/rabbitmq/rabbitmq-server/blob/v3.13.2/deps/rabbit/Makefile#L117) and if something is not blocked explicitly someone will make use of it.

The test case with a very very small server and a single, very very large message is just an artificial edge case where I could analyze the problem easier. But I believe the optimisation can still have value for many but much smaller messages

@lhoguin
Copy link
Contributor

lhoguin commented May 7, 2024

Large messages (above the file size limit of 16MB by default) have interesting properties:

  • they are guaranteed to be the last in the file (since the file will exceed the limit and the store will roll over to a new file after writing the message)
  • they can't be moved during compaction because they are necessarily bigger than whatever data exists before them
  • there can't be data after them in the file

So to know whether a large message is real or not, we can just check whether Offset + Size + 9 =:= FileSize and then check the final byte of the file to be 255. We don't need to do any further reading, and we know we are done with this message.

If the first check is correct but the second isn't, the file is also necessarily corrupt because we are blanking data starting in 3.13.2 during compaction. So the probability of seeing false positives is very low. Since it shouldn't happen much, if at all, if we get the final byte to be something other than 255, we can afford the extra file:position to go back to where we left off.

I don't know if the file size limit is often configured to a different value than 16MB, and I doubt handling files in the 4MB-16MB range specially would bring much. But perhaps you have a more real world example that would prove this wrong.

Perhaps it would be better to simply write messages > 16MB in their own files, as well, since they can't be compacted anyway.

@lhoguin
Copy link
Contributor

lhoguin commented May 7, 2024

Thinking about it some more I think the solution to write large messages in their own file makes everything much simpler. If we get a large message, we close the current file, write the message in the next file, then roll over to a third file for more normal messages. Then when scanning if the first message is the size of the file we branch to a simpler scan function to verify the end of the file.

That's assuming there's no problem in the 4MB-16MB range and that people don't have fancy file_size_limit configurations (but those that do I would expect understand what they are doing).

@michaelklishin
Copy link
Member

We can and should lower the limit. The idea was to gradually lower it from its original 512 MiB default but we have missed the opportunity to go to, say, 64 MiB in 3.13.x. Oh well.

I guess 4.0 can use a 64 MiB or a 50 MiB limit.

@gomoripeti
Copy link
Contributor Author

good point that "very large" messages (above file size limit) are always the last and good idea to store them alone.

I was also thinking about "fake" size prefixes. Can these cause very inefficient scanning in some patological or malicious file content? If the size would be stored on 4 bytes only a file full of 1 bytes (<<1, 1, 1, 1, 1, ...>>) would represent about 16MB and and an rdq file of size 20MB would have the first 4MB of bytes skippped one by one.... But the size is stored on 64 bits, so it must contain some leading zeros so it cannot contain consecutive valid looking sizes that are still within the file size boundary. I did not consider how compaction works and these file contents are not even possible.

I will do some more testing with 4-16MB messages. But I think the second patch that reads more than ?SCAN_BLOCK_SIZE can be beneficial and it's a minor change (to avoid some (just a few) unnecessary appends)

we use default rdq file_size limit. When experimenting the maximum we adjust is the embed threshold (for example to store more messages in the per queue index and reduce load on the shared store) so I don't have experience with larger file sizes with not just one big message.

@gomoripeti
Copy link
Contributor Author

(I realize I'm not very consistent. In this issue I support optimisations for messages of size 4-16MB and even >16MB, and in #11187 I propose to reduce default max message size to 4MB or bellow. I will investigate some real use cases for large messages)

@luos
Copy link
Contributor

luos commented May 7, 2024

Hi! I definitely do think there are some use cases for large messages. Mostly just people who already have RabbitMQ and need to send sometimes a large message and don't want to stand up an object store just for 3 objects / day.

It is a very common use case to send emails or pdfs through RabbitMQ - which obviously grow in size in production. 😄

Any optimisation to large messages is useful, but at least they should not blow up memory usage, but it's fine if it's not the fastest use case.

Thank you for thinking about it @gomoripeti .

I think that the original reason for lowering the message size (#1812) is not that valid any more as heartbeat issue got fixed in Erlang, though we know the clustering link is still a weak point of RabbitMQ.

@lhoguin
Copy link
Contributor

lhoguin commented May 13, 2024

I don't want to "break" the file format too much and unexpectedly run into further issues. I have to focus on some other areas for a while. So having a fake size is something I would go against. Having a better format would be a good idea though. Currently the index is fully in memory which leads to recovery after crash taking forever, as well as the memory usage increasing with the number of messages in the store. A better format may fix that.

It would be good to have reproduction steps for messages that do not go above 16MB, to understand exactly what is going on. Messages above 16MB can be handled very easily.

Alternatively, comparing the current code with 3.12 code might tell us more. The scan block size is the same as the old file_handle_cache code I think.

I have tried an optimisation to skip zeroes faster but it turns out that Erlang scans binaries really fast already so did not see any gains. On the other hand I did another optimisation which more than halves the dirty recovery time in fbf11f5 and likely reduces the memory footprint a little as well.

@lhoguin
Copy link
Contributor

lhoguin commented May 13, 2024

A better format may fix that.

And it will also let us avoid scanning, therefore avoid memory issues that may arise from doing that...

@lhoguin
Copy link
Contributor

lhoguin commented May 13, 2024

I think both the 1-file per huge message, and the skip_body function for large-but-not-huge messages, would be good to have.

@lhoguin
Copy link
Contributor

lhoguin commented May 13, 2024

Honestly it probably makes sense to write not only 16MiB+ messages in their own files, but 8MiB+ messages as well, or even 4MiB+...

  • It won't create that many more files
  • We know we will rarely benefit from compaction for messages above 8MiB, or larger messages in general (the probability we can move a message to the beginning of the file during compaction is inversely proportional to its size)
  • We will be able to delete the file directly when the message is removed, instead of waiting for compaction
  • We will be less likely to run into issues with larger messages since they'll be handled specially during crash recovery and won't be compacted

So I'm leaning toward making all 4MiB+ encoded data sit into its own file.

@gomoripeti
Copy link
Contributor Author

thank you very much, Loic, for the time you dedicate to this topic, thinking it through and the code changes.
I see you already have a draft for 4.0 for the "large messages go to separate file" strategy.

Would it make sense to apply the "scanner reads large message in one go" patch cloudamqp@6c216aa as a short term improvement for 3.x versions? (The current appending would result in the same binary that is read in one go, just avoiding some append and binary reallocations, so this change would have minimal difference in behaviour with a bit better memory characteristics) At this point the "seeking" patch does not make sense any more in light of your upcoming change.

I'm travelling this week but as soon as I can I will run some tests regarding binary memory characteristics of both #11112 and my suggested patch.

@lhoguin
Copy link
Contributor

lhoguin commented May 15, 2024

Yes that patch seems like a good one for 3.13 if it helps with the memory issues, please open a PR against that branch and I will review it. For 4.0 it will probably be unnecessary (messages should fit within 2 reads at most, like this patch does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants