Performance issues / memory core dumps with big files #65

melvinroest · 2020-06-09T08:31:31Z

We're experiencing the jsonld streaming parser to go out of memory for large files.

To reproduce, store the following file as test.js in the project directory.

const fs = require("fs");
const { JsonLdParser } = require("./");
const zlib = require("zlib");
fs.createReadStream("./test.jsonld.gz")
  .pipe(zlib.createGunzip())
  .pipe(
    new JsonLdParser({
      baseIRI: "http://base"
    })
  )
  .on('data', () => {})

To run this file:

curl https://test.triply.cc/laurensrietveld/iconclass/assets/5eda510c6300450368fbd900 -L > test.jsonld.gz;
node test

Tested on version 2.0.2 and node 12.18.0 / 14.4.0.

The text was updated successfully, but these errors were encountered:

rubensworks · 2020-06-10T08:34:14Z

Thanks for reporting!

This may be caused by the underlying jsonparse dependency.
I think this parser may be keeping to much in memory during parsing.

I'll look into whether or not this can be avoided, otherwise I may have to switch to a different streaming JSON parser (or write my own...). (e.g. https://www.npmjs.com/package/stream-json)

In the meantime, increasing your Node memory is the only easy way to fix this I think.

wouterbeek · 2020-06-23T14:26:49Z

@rubensworks Is it theoretically possible to parse JSON-LD with a streaming JSON parser? IIUC the format does not guarantee that all transformations are defined up-front. IOW, at the end of the stream a transformation may be defined that has to be applied to an element in the beginning of the stream.

I'm not sure whether this is a good example, but I'm thinking along the following lines:

{
  "abc": "def",
  "key": "123",
  "@context": {
    "@vocab": "https://example.com/",
    "key": "@id"
  }
}

rubensworks · 2020-06-23T14:31:10Z

@wouterbeek Great question, the full answer is available in this spec.

TL;DR, under certain assumptions, JSON-LD can be parsed in a streaming manner.
When these assumptions are not met, then the parser can either fail, or fall back to a less efficient mode.
For this, the streamingProfile config option of this parser is important, as enabling it will allow the parser to make use of the streaming mode, but it will fail when non-streamable JSON-LD documents are detected.

wouterbeek · 2020-06-23T15:04:02Z

Too bad the order is not enforced in the JSON-LD specification :-( An alternative would be to parse JSON-LD files twice: once to extract the schema and once to apply it.

rubensworks · 2020-06-24T13:53:34Z

Too bad the order is not enforced in the JSON-LD specification

True, the only way to enforce this is via the streaming profile.

An alternative would be to parse JSON-LD files twice: once to extract the schema and once to apply it.

This is essentially what this parser will do internally when streaming mode is disabled (with some optimizations where possible).

wouterbeek mentioned this issue Jan 17, 2021

Request for guidance for parsing larger JSON-LD data files w3c/json-ld-syntax#366

Closed

gkellogg mentioned this issue Jan 22, 2021

Expand section on "Serializing Large Datasets" w3c/json-ld-bp#29

Open

rubensworks mentioned this issue Apr 21, 2021

feat: Update ChainedConverter to create dynamic paths CommunitySolidServer/CommunitySolidServer#697

Merged

rubensworks added the comunica-association-bounty label May 4, 2021

This was referenced Jul 20, 2021

Ensure backpressure is maintained in streams #71

Open

Stream goes into flowing mode immediately #66

Open

Stream scalability issues #76

Open

rubensworks removed the comunica-association-bounty label Jul 20, 2021

Tpt mentioned this issue Aug 17, 2022

Replace jsonparse dependency #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues / memory core dumps with big files #65

Performance issues / memory core dumps with big files #65

melvinroest commented Jun 9, 2020 •

edited by rubensworks

Loading

rubensworks commented Jun 10, 2020 •

edited

Loading

wouterbeek commented Jun 23, 2020

rubensworks commented Jun 23, 2020

wouterbeek commented Jun 23, 2020

rubensworks commented Jun 24, 2020

Performance issues / memory core dumps with big files #65

Performance issues / memory core dumps with big files #65

Comments

melvinroest commented Jun 9, 2020 • edited by rubensworks Loading

rubensworks commented Jun 10, 2020 • edited Loading

wouterbeek commented Jun 23, 2020

rubensworks commented Jun 23, 2020

wouterbeek commented Jun 23, 2020

rubensworks commented Jun 24, 2020

melvinroest commented Jun 9, 2020 •

edited by rubensworks

Loading

rubensworks commented Jun 10, 2020 •

edited

Loading