Memory leak when trying to import large data set via WriteStream #298

danburzo · 2015-01-09T00:39:34Z

I am getting out of memory errors when I'm trying to import about 500MB worth of nodes from an OpenStreetMap JSON file. Here's my code, in case I am approaching this the wrong way:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    through2 = require('through2'),
    levelup = require('level');

var db = levelup('db');

var write_stream = db.createWriteStream();

fs.createReadStream('output/nodes.json',{ encoding: 'utf8' })
    .pipe(JSONStream.parse('*.*'))
    .pipe(through2.obj(function(item, enc, next){
                /* json looks like [ [ [id,lat,lon], [id,lat,lon] ... ] ] */
        this.push({ 
            key: item[0] + "",  // id
            value: item[1] + "," + item[2] // lat,lon
        });
        next();
    }))
    .pipe(write_stream);

This happens for level@0.18.0 (levelup@0.18.6 + leveldown@0.10.2), on Node v0.10.35 OS X.

mcollina · 2015-01-09T08:04:52Z

Quite possible, we even want to deprecate WriteStream #199. However, it should respect backpressure and not have this kind of issue

My educated guess is that JSONStream put everything in 'old mode' and then we lose backpressure somehow. Or worse, the JSON structure causes a massive stack in JSONStream. @dominictarr your help is needed here :).

Can you please try to see if the same happens with a newline-delimited-json thing (you can parse it with split2?

I would recommend anyway against having a JSON thing of 500MB. Use something simpler, like newlines, and everything will be easier. JSON is not built for big files.

danburzo · 2015-01-09T09:32:31Z

Thank you for the recommendations. Unfortunately the memory still goes through the roof (I am getting FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory.
) with this version:

var fs = require('fs'),
    JSONStream = require('JSONStream'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var db = levelup('db');

var write_stream = db.createWriteStream();

fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
    .pipe(split2())
    .pipe(through2.obj(function(line, enc, next){
        var parts = line.split(',');
        this.push({ 
            key: parts[0] + "", 
            value: parts[1] + "," + parts[2]
        });
        next();
    }))
    .pipe(write_stream);

nodes.txt is a 600MB newline-delimited set of nodes in the form id, lat, lon.

Indeed is sounds like the backpressure is not working as expected, but I'm not sure if levelup is trying to write too frequently or that in effect split2/JSONStream is too fast and levelup's buffer spills over by the time it gets to write. [N.B. I'm still a n00b with streams, so my intuition might be horribly off :-)]

brycebaril · 2015-01-09T09:45:11Z

I used to have this problem as well with old versions of levelup but I hadn't seen it for quite some time. I wonder if the object mode backpressure signalling is the issue with the size of your keys being large. Maybe try manually setting lower highWaterMarks to test...

Also, you may want to check out https://github.com/maxogden/level-bulk-load which will attempt to auto-batch the writes which is a significant improvement in my experience.

Another optimization you can do is pre-reverse sort your data by key, though I wouldn't expect that should be necessary.

@maxogden has also spent a lot of time working with bulk-loading leveldb

mcollina · 2015-01-09T10:07:06Z

Let me suggest one trick I learned in the past:

Replace:

        next();

with

        setImmediate(next);

Also, you should wait your level to be open before starting the import.

danburzo · 2015-01-09T10:44:31Z

Wow, with setImmediate(next) it's writing like a champ! Although it seems to be slower, the memory is consistently at ~50MB and I don't expect it to crash any time soon.

Going with the levelup(db, callback) pattern does not help the memory problem in itself, but it does sound like a good idea since there can be quite a lot of writes queued by the time the database opens.

Keys are 64 bit integers and replacing them with 32 bit integers for the purpose of this exercise did help, but not enough to prevent the memory issue.

I will also check out level-bulk-load to see if it helps, but I'm glad there's a workable solution in the meantime.

mcollina · 2015-01-09T10:57:27Z

if you want it to be faster, you can call setImmediate every 100 (or more) samples. It's ugly but it works.

danburzo · 2015-01-09T11:28:17Z

Awesome 👍
I'm using it for scripts in a controlled environment, so it does not matter if it's a bit ugly if it works faster.
I'm also closing this issue in light of the decision to renounce WriteStream in levelup.
Thank you, guys!

For future reference, this is the final code that works well:

var fs = require('fs'),
    through2 = require('through2'),
    split2 = require('split2'),
    levelup = require('level');

var i = 0;
levelup('db', function(err, db) {
    var write_stream = db.createWriteStream();

    fs.createReadStream('output/nodes.txt', { encoding: 'utf8' })
        .pipe(split2())
        .pipe(through2.obj(function(line, enc, next){
            var parts = line.split(',');
            this.push({ 
                key: parts[0], 
                value: parts[1] + "," + parts[2]
            });

            // Prevent memory leak
            // See: https://github.com/rvagg/node-levelup/issues/298
            if (i++ > 999) {
                setImmediate(next);
                i = 0;
            } else {
                next();
            }
        }))
        .pipe(write_stream);
});

danburzo closed this as completed Jan 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak when trying to import large data set via WriteStream #298

Memory leak when trying to import large data set via WriteStream #298

danburzo commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015

brycebaril commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015

Memory leak when trying to import large data set via WriteStream #298

Memory leak when trying to import large data set via WriteStream #298

Comments

danburzo commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015

brycebaril commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015

mcollina commented Jan 9, 2015

danburzo commented Jan 9, 2015