-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive Memory Usage for Large Json File #1516
Comments
Thanks for reporting! Indeed this library is not as memory-efficient as it could be. This is due to the fact that parsed values are by default stored in a DOM-like hierarchy, ready to be read and changed later on. For this, we use STL types like I fear there is little we can do at the moment - the file has some 59 million values to store. If you just want to process the file (e.g., summing up some values or looking up a particular object) without the need of storing the whole value in memory, you can define a parser callback or a dedicated SAX parser and let their logics decide while parsing whether to store read values. I created a (pretty-printed) JSON with your program above and opened it in some tools. It seems they also have some issues with memory...
|
A quick look at the code shows heavy array usage. It is probably worthwhile writing a program that runs through the whole DOM and adds up the std::vector::capacity() values for all the vectors. Since vectors grow via a capacity-doubling strategy, this alone could produce ~2x bloat. If that's the case some judicious use of reserve/resize might improve things. Is there a reserve call on json::array()? It might be worth adding to avoid all the intermediate allocations that can appear (depending on how you're measuring) to bloat memory footprint as well. Other than that though, Neil is right in that there may not be a lot that can be done due to the use of a friendly DOM-like data structure built from the standard library. A 10x expansion is not uncommon in other languages as well (e.g. Python). The SAX approach can be used to build your own (presumably more efficient) data structure, or to just process data on-the-fly. |
There currently is no reserve call for arrays, objects, or strings. This, however, would not help during parsing where the number of elements to come is unknown. A different situation comes with formats like CBOR or MessagePack where the size of arrays, objects, and strings are given before the actual values. Then, however, it would be unsafe to just take these values and call reserve without checks. One way to reduce the overhead would be to add a |
It is possible to use resize/reserve to implement a different growth strategy for vectors that trades more frequent allocations for less wasted space. But yes, its not ideal when parsing incoming json. I do suggest measuring before doing anything. It ought to be easy to traverse and use capacity to evaluate how much extra allocation has been done in a test case like the one in the OP. |
Are you worried about it allocation too much memory? Otherwise, safety shouldn't be an issue.
|
I added the reserve call once, but then OSS-Fuzz only took a few hours to generate an example with a CBOR file that announced an array with billions of elements which crashed the library.
I know. But I am curious :) |
Okay, that could be solved by putting an upper bound on the initial reserve. If we said that we'd support up to 1024 or 1000 or something like that, then that should be plenty. By that point, it's a fair way up the allocation curve. |
I really like the idea here, as I've had large-ish objects that I really want to compact because it'll be around a while. A I also think a shrink-while-parsing option would also be nice because it would allow you to read in larger objects than you might otherwise have been able to and (I think?) would be kinder to heap fragmentation. By the way, I'm not concerned about whether shrink_to_fit has a guarantee. We should trust the STL to do the right thing, and if it doesn't, I don't think we should try to do better. |
In fact, shrinking would be easy during parsing, because we have a dedicated Intuitively, I would shrink bottom-up, so in a vector of strings, I would first shrink the strings and then the vector - this would also be the order when shrinking would be integrated into the parser. But we should definitely take measurements. |
Trusting the STL is usually smart because a lot of effort has gone into writing and debugging it, but the fact remains that it is generic, i.e. by definition not specific to a particular use case. If you know more about your particular situations and needs, you can often improve upon its generic behaviour. Vector allocation lengths are one place where I've seen improvements over and over again -- that's why the reserve and capacity methods are there. But to do this you do absolutely need to "know more" about the specific situation. I think an interesting case would be shrink_to_fit during parsing. The parser ought to be able to trim excess allocations as it progresses. This should be an option as you don't always want it to do this, but if you're loading a massive file it will likely be a large win in terms of memory footprint (although it could easily be slower because of the reallocation/copy that might happen if the memory allocator being used doesn't allow existing blocks to be shrunk). |
C++ memory allocators don't allow blocks to be shrunk. That's part of why |
Looking in libc++ implementation, the std::vector::shrink_to_fit does do something. The source for string was a little hard to follow but I did a quick manual test and its std::string does reallocate (or, in my test case, moves to the stack and releases the heap allocation). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'd like to re-open this issue. I had it parse a 700mb file and it took 9gb. That's over 10x memory. My understanding is that strings default size are implementation independant, but generally start at 10-20 bytes. So if your JSON contains many short strings, getting a 10x increase in memory is trivial, when you also take into account the rest of the DOM datastructure. I'd guess that most entries in a parsed json will need to change rarely, so shrink_to_fit on the strings and vectors would likely also come with potential speed improvements (more data could fit into the L1 and L2 caches). |
For me std::string is 32 bytes on 64-bit system. It's initialize capacity is 14-15, which is embedded into the std::string structure using a clever implementation. The other STL structures also take up similar space without allocating anything on the heap. This is why it takes up so much memory, you can implement your own json type to try to reduce the memory as much as possible. You can implement it using the sax parser. Or I think basic_json is just a template, you can plug your new types into, though I imagine they would then need to match the STL interface. https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/input/json_sax.hpp#L145 |
I just got into this same issue, some 1.6GB json wouldn't load (even with 10GB memory). |
Parsing a 1.6 GB file takes up more than 10 GB of memory. I've also seen it go upwards of close to 20 GB of memory for a similar sized file.
Can generate a file with the below code, don't have anywhere to upload that large of a file.
The text was updated successfully, but these errors were encountered: