Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues / memory core dumps with big files #65

Open
melvinroest opened this issue Jun 9, 2020 · 5 comments
Open

Performance issues / memory core dumps with big files #65

melvinroest opened this issue Jun 9, 2020 · 5 comments

Comments

@melvinroest
Copy link

melvinroest commented Jun 9, 2020

We're experiencing the jsonld streaming parser to go out of memory for large files.

To reproduce, store the following file as test.js in the project directory.

const fs = require("fs");
const { JsonLdParser } = require("./");
const zlib = require("zlib");
fs.createReadStream("./test.jsonld.gz")
  .pipe(zlib.createGunzip())
  .pipe(
    new JsonLdParser({
      baseIRI: "http://base"
    })
  )
  .on('data', () => {})

To run this file:

curl https://test.triply.cc/laurensrietveld/iconclass/assets/5eda510c6300450368fbd900 -L > test.jsonld.gz;
node test

Tested on version 2.0.2 and node 12.18.0 / 14.4.0.

@rubensworks
Copy link
Owner

rubensworks commented Jun 10, 2020

Thanks for reporting!

This may be caused by the underlying jsonparse dependency.
I think this parser may be keeping to much in memory during parsing.

I'll look into whether or not this can be avoided, otherwise I may have to switch to a different streaming JSON parser (or write my own...). (e.g. https://www.npmjs.com/package/stream-json)

In the meantime, increasing your Node memory is the only easy way to fix this I think.

@wouterbeek
Copy link

@rubensworks Is it theoretically possible to parse JSON-LD with a streaming JSON parser? IIUC the format does not guarantee that all transformations are defined up-front. IOW, at the end of the stream a transformation may be defined that has to be applied to an element in the beginning of the stream.

I'm not sure whether this is a good example, but I'm thinking along the following lines:

{
  "abc": "def",
  "key": "123",
  "@context": {
    "@vocab": "https://example.com/",
    "key": "@id"
  }
}

@rubensworks
Copy link
Owner

@wouterbeek Great question, the full answer is available in this spec.

TL;DR, under certain assumptions, JSON-LD can be parsed in a streaming manner.
When these assumptions are not met, then the parser can either fail, or fall back to a less efficient mode.
For this, the streamingProfile config option of this parser is important, as enabling it will allow the parser to make use of the streaming mode, but it will fail when non-streamable JSON-LD documents are detected.

@wouterbeek
Copy link

Too bad the order is not enforced in the JSON-LD specification :-( An alternative would be to parse JSON-LD files twice: once to extract the schema and once to apply it.

@rubensworks
Copy link
Owner

Too bad the order is not enforced in the JSON-LD specification

True, the only way to enforce this is via the streaming profile.

An alternative would be to parse JSON-LD files twice: once to extract the schema and once to apply it.

This is essentially what this parser will do internally when streaming mode is disabled (with some optimizations where possible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants