-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise how rdq files are scanned at CQ shared message store recovery startup #11072
Comments
I would expect the unused message to be a sub-binary, but perhaps it isn't. The function mirrors what was done before 3.13. We have not seen extra memory during testing as a result of compaction in any case. Either way I am currently fixing crashes related to this so I will investigate at the same time. |
Let me rephrase: we have no seen extra memory in 3.13.1. We did fix a bug in 3.13.0 that lead to compaction being very slow, using lots more memory than normal. But for 3.13.1 I wouldn't expect much extra memory. |
thanks for the prompt response! It is possible that the issue is only visible during startup/recovery, when all the rdq files are scanned. (So for combine and delete of one file at a time it would be an unnoticeable improvement) In case of messages larger than 4MB, let's say for a 16MB message, the file is read in 4MB chunks and 4 of those chunks are concatenated to reach the end_marker. In this case it is not just sub-binaries. So maybe the issue manifests only when there are some large enough messages in the rdq files. (I know it is not good practice to have megabyte of message bodies but unfortunately we see them from time to time even on smaller servers) I will gather some example rdq files and an exact reproduction with how large a memory spike really is. |
We currently do not test with above 1MB messages so I would not be surprised if there were issues there. Looking forward to the data! |
I have only some initial information One incident about a week ago happened on a 3-node cluster on RabbitMQ 3.12.12 with 2 schedulers and ~1.8GB RAM. No info about rdq files unfortunately. (Only relevant because it is relatively new version, but not 3.13) I tried some reproductions locally on a recent main @ 57f9aec Steps
Test cases
So to sum up, on latest main I don't see that huge "exponential" or orders of magnitude higher memory consumption. (There is room for some improvement but not as huge as I expected) I will continue investigating. |
Now I have a reproduction, at least a definition of high memory consumption Reproduction
After obtaining the rdq file a simpler reproductions is to just call the scanning function directly
In syslog I see that the RSS memory of the beam process is above 600MB
I think what happens is that the 105MB message body is appended from 4MB pieces, and very often there is a reallocation, which leaves a lot of fragmentation behind. (I created a script that terminates the beam process when it goes above 500MB RSS and create a crash_dump. I don't see in that crash dump where that 600MB is allocated by the memory carriers. Some inspection details
Patches I created an initial patch which uses I was also curious if the same issue affects the code which actually reads a message from disk. But that code (rabbit_msg_store:reader_pread/2) reads the message in one go without chunking and appending. Indeed if the scanner would also just read the large message in one go it does not OOM on the same small-memory instance either, as I tested it with the following patch cloudamqp@6c216aa Is this all reasonable? Is the approach using |
Thanks. I need to think about it. |
110 MiB messages belong to a blob store, not to a RabbitMQ queues. |
absolutely, I also wrote earlier:
The default message size limit is 128MB (https://github.com/rabbitmq/rabbitmq-server/blob/v3.13.2/deps/rabbit/Makefile#L117) and if something is not blocked explicitly someone will make use of it. The test case with a very very small server and a single, very very large message is just an artificial edge case where I could analyze the problem easier. But I believe the optimisation can still have value for many but much smaller messages |
Large messages (above the file size limit of 16MB by default) have interesting properties:
So to know whether a large message is real or not, we can just check whether If the first check is correct but the second isn't, the file is also necessarily corrupt because we are blanking data starting in 3.13.2 during compaction. So the probability of seeing false positives is very low. Since it shouldn't happen much, if at all, if we get the final byte to be something other than I don't know if the file size limit is often configured to a different value than 16MB, and I doubt handling files in the 4MB-16MB range specially would bring much. But perhaps you have a more real world example that would prove this wrong. Perhaps it would be better to simply write messages > 16MB in their own files, as well, since they can't be compacted anyway. |
Thinking about it some more I think the solution to write large messages in their own file makes everything much simpler. If we get a large message, we close the current file, write the message in the next file, then roll over to a third file for more normal messages. Then when scanning if the first message is the size of the file we branch to a simpler scan function to verify the end of the file. That's assuming there's no problem in the 4MB-16MB range and that people don't have fancy file_size_limit configurations (but those that do I would expect understand what they are doing). |
We can and should lower the limit. The idea was to gradually lower it from its original 512 MiB default but we have missed the opportunity to go to, say, 64 MiB in 3.13.x. Oh well. I guess 4.0 can use a 64 MiB or a 50 MiB limit. |
good point that "very large" messages (above file size limit) are always the last and good idea to store them alone. I was also thinking about "fake" size prefixes. Can these cause very inefficient scanning in some patological or malicious file content? If the size would be stored on 4 bytes only a file full of I will do some more testing with 4-16MB messages. But I think the second patch that reads more than ?SCAN_BLOCK_SIZE can be beneficial and it's a minor change (to avoid some (just a few) unnecessary appends) we use default rdq file_size limit. When experimenting the maximum we adjust is the embed threshold (for example to store more messages in the per queue index and reduce load on the shared store) so I don't have experience with larger file sizes with not just one big message. |
(I realize I'm not very consistent. In this issue I support optimisations for messages of size 4-16MB and even >16MB, and in #11187 I propose to reduce default max message size to 4MB or bellow. I will investigate some real use cases for large messages) |
Hi! I definitely do think there are some use cases for large messages. Mostly just people who already have RabbitMQ and need to send sometimes a large message and don't want to stand up an object store just for 3 objects / day. It is a very common use case to send emails or pdfs through RabbitMQ - which obviously grow in size in production. 😄 Any optimisation to large messages is useful, but at least they should not blow up memory usage, but it's fine if it's not the fastest use case. Thank you for thinking about it @gomoripeti . I think that the original reason for lowering the message size (#1812) is not that valid any more as heartbeat issue got fixed in Erlang, though we know the clustering link is still a weak point of RabbitMQ. |
I don't want to "break" the file format too much and unexpectedly run into further issues. I have to focus on some other areas for a while. So having a fake size is something I would go against. Having a better format would be a good idea though. Currently the index is fully in memory which leads to recovery after crash taking forever, as well as the memory usage increasing with the number of messages in the store. A better format may fix that. It would be good to have reproduction steps for messages that do not go above 16MB, to understand exactly what is going on. Messages above 16MB can be handled very easily. Alternatively, comparing the current code with 3.12 code might tell us more. The scan block size is the same as the old file_handle_cache code I think. I have tried an optimisation to skip zeroes faster but it turns out that Erlang scans binaries really fast already so did not see any gains. On the other hand I did another optimisation which more than halves the dirty recovery time in fbf11f5 and likely reduces the memory footprint a little as well. |
And it will also let us avoid scanning, therefore avoid memory issues that may arise from doing that... |
I think both the 1-file per huge message, and the skip_body function for large-but-not-huge messages, would be good to have. |
Honestly it probably makes sense to write not only 16MiB+ messages in their own files, but 8MiB+ messages as well, or even 4MiB+...
So I'm leaning toward making all 4MiB+ encoded data sit into its own file. |
thank you very much, Loic, for the time you dedicate to this topic, thinking it through and the code changes. Would it make sense to apply the "scanner reads large message in one go" patch cloudamqp@6c216aa as a short term improvement for 3.x versions? (The current appending would result in the same binary that is read in one go, just avoiding some append and binary reallocations, so this change would have minimal difference in behaviour with a bit better memory characteristics) At this point the "seeking" patch does not make sense any more in light of your upcoming change. I'm travelling this week but as soon as I can I will run some tests regarding binary memory characteristics of both #11112 and my suggested patch. |
Yes that patch seems like a good one for 3.13 if it helps with the memory issues, please open a PR against that branch and I will review it. For 4.0 it will probably be unnecessary (messages should fit within 2 reads at most, like this patch does). |
Is your feature request related to a problem? Please describe.
The symptom is that after an unclean shutdown when RabbitMQ starts up and runs recovery steps, its memory usage spikes to a seemingly unreasonable way. On smaller servers with let's say only 2GB of memory it causes an OOM (cyclic restarts and the node cannot start up). We have good reason to believe that it is the recovery of the classic queue shared storage, when it reads in all the rdq files, that is responsible for this behaviour. Although there are large rdq files, the total message size of "ready" messages in queues is 1 or 2 orders less than the memory used.
This issue was observer since years from time to time including on 3.12.x
Describe the solution you'd like
I would like to get early feedback from the Core Team if it is worth investigating (we definitely see this issue often enough) and if contributions would be accepted.
The function rabbit_msg_store:scan_file_for_valid_messages is used to get the MsgIds and their offsets in the file, but at the same time also the message bodies are read in which are thrown away. This might lead to extra large binaries which are not used and extra garbage.
Part of the problem could be the presence of unreferenced messages in rdq files which are not yet removed from the file. When a server manages to start up after a successful recovery, we've seen the size of the rdq files shrunk. The GC of unreferenced entries changed considerably in 3.13 where the msg store uses a compaction mechanism instead of combining two files. But still there were seen issues (#10681), so I believe this change could be still relevant. The
scan_file_for_valid_messages
function is not just used at recovery time, but also during compaction and rdq deletion.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: