Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI crashes on JSON files > ~256MB #17

Closed
vsivsi opened this issue Mar 5, 2014 · 8 comments
Closed

CLI crashes on JSON files > ~256MB #17

vsivsi opened this issue Mar 5, 2014 · 8 comments

Comments

@vsivsi
Copy link
Contributor

vsivsi commented Mar 5, 2014

Hi great tool, but I've run into what seems like an arbitrary limitation/bug. I found two OSM Overpass output JSON files that differ by a single way with 4 nodes. The smaller one runs without failure. The larger one crashes consistently with:

FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory
Abort trap: 6

When the smaller file runs it requires less than 2GB of memory (out of ~15GB available).
The 1 way / 4 node difference is unremarkable, and is not the cause of the problem.

I know this because if I add space characters to the beginning of the smaller "good" file until it is the size of the larger file, it also fails with the above error. So this seems entirely about file size and not content. I have distilled it down to two files which contain the exact same JSON content (the good smaller file from above), but differ by the amount of leading whitespace added (about 240 bytes)

The file that is 268435577 bytes works, and the one that is 268435578 bytes (with one additional leading space character) fails. 256MB is 268435456 bytes, so these files are both slightly larger than that and maybe it's just a coincidence.

I'm running this on a Mac Mini with OS X 10.9.2 and 16GB of RAM. I'm using node.js version 10.26 installed from MacPorts. The osmtogeojson version is 2.0.4 and was installed from npm.

The "good" file can be obtained from Overpass with this command (assuming the OSM database doesn't change between now and when you try it):

curl -X POST -d '[out:json]; ( node(47.517358482056004,-122.44160651794401,47.728472517944,-122.240223482056); way(bn); rel(bw)["type"="multipolygon"];);( ._; way(r););( ._; node(r)->.x; node(w););out body qt;' http://overpass-api.de/api/interpreter > file.json

This command currently returns a file of 268435336 bytes. If you get more than that try pulling in the bounding box a tiny bit (6th decimal place) until it's less than this amount. That file should convert. Now add about 300 spaces to the beginning of the file until the size is about 268435600 bytes; that one should fail.

I've tried adding these v8 parameters to the node command to increase the memory it will allocate, but it doesn't help (I regularly run node scripts using these parms that consume 10+ GB of memory without any problems):

--max_new_space_size (max size of the new generation (in kBytes))
--max_old_space_size (max size of the old generation (in Mbytes))

Let me know if you need any more info.

@tyrasd
Copy link
Owner

tyrasd commented Mar 5, 2014

Thanks for the much detailed bug report. I can confirm this issue on Linux, too. I'll have a look.

@vsivsi
Copy link
Contributor Author

vsivsi commented Mar 6, 2014

The problem is in the concat-stream package's concat write stream. Node.js buffers are limited to 1GB, so using that package gives that as a hard limit. However, something else must be happening internally to make the practical limit ~1/4 of that.

I verified this by instead using the JSONStream npm package to read and parse the input stream into a js object (also skipping the JSON.parse()) This works perfectly for JSON data, but you'll obviously need to find another solution for XML input. I don't work much with XML these days, but there must be a decent streaming XML parser in npm that can invoke callbacks for each of the possible children of the <osm> tag with either DOM or JSONified versions of the XML.

Anyway JSONStream is great, I've used it for several other projects and it totally solves this issue. It's slightly slower than the native JSON.parse() but with the huge advantage that you never need to have all of the JSON document text in memory at once.

@vsivsi
Copy link
Contributor Author

vsivsi commented Mar 6, 2014

If you want to try out JSONStream, here's the code I added:

if (format === 'json') {
    (filename ? fs.createReadStream(filename) : process.stdin).pipe(JSONStream.parse())
           .on('root', function(data) {
                convert(data);  // Note that data is an object, not a Buffer, so convert needs to test for that
            })
           .on('error', function(err) {
               console.error("ERROR: JSON input stream could not be opened or parsed.")
               process.exit(1);
            });
} else {
    (filename ? fs.createReadStream(filename) : process.stdin).pipe(concat(convert));
}

@tyrasd
Copy link
Owner

tyrasd commented Mar 6, 2014

Actually, the bottleneck here was the .toString(), because the maximum string length in v8 seems to be limited to something around 2^28. But I see that the concated buffer would also be an unnecessary limiting factor.

JSONStream

Yes, this seems like the way to go. I've also tried that and it works very well (with the only minor flaw of some rounding errors).

decent streaming XML parser

In fact, I do already use such an XML parser (htmlparser2), so the direct streaming approach should be fairly easy to adopt there, too.

@tyrasd tyrasd closed this as completed in 84dbbcb Mar 6, 2014
@tyrasd
Copy link
Owner

tyrasd commented Mar 6, 2014

Fixed now in 84dbbcb. Can you give it a try? I tested it with JSON and XML files up to 1.5 GB - worked well with memory consumption up to ~7GB. One needs to run the script with a reasonably set node --max_old_space_size=… <path-to-osmtogeojson-script>, though. Otherwise it gets stuck in some kind of garbage-collection loop. Isn't there a better way to tell a npm-installed script to use more memory?

@vsivsi
Copy link
Contributor Author

vsivsi commented Mar 7, 2014

@tyrasd, Thanks for the update. I've tested against my "good" and "bad" JSON files and everything works. I've also tested with a ~500MB JSON file and that works as well. I don't currently work with XML data, but I've modified a couple of my Overpass queries to return XML and they work fine.

Isn't there a better way to tell a npm-installed script to use more memory?

One of my collaborators wrote this nodewrap script for our project that runs a .js file (with parameters) while automatically setting node's --max_old_space value based upon the actual system memory. It currently has code paths for Linux and Darwin (no Windows support, sorry). You are welcome to take a look at that and adapt it or mine for ideas. That project is GPLed, but we'd be happy to release that file separately under MIT if you would like to use it (with attribution).

@vsivsi
Copy link
Contributor Author

vsivsi commented Mar 7, 2014

Slightly more explanation on how to use nodewrap. Create a file with the executable name you want (e.g. osmtogeojson) with this in it:

#!/bin/bash
# JavaScript filename to run with node
js=osmtogeojson.js
# Run node script with custom memory settings
nodewrap "$js" "$@"

@tyrasd
Copy link
Owner

tyrasd commented Mar 17, 2014

2.0.5 is now out which includes the fixes here. At this time I've not included the nodewrap script (but thanks for the tip anyway!), instead one should manually run the script with appropriate memory settings as explained in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants