🐛 Scraper Memory usage #926

ulfgebhardt · 2019-10-25T14:23:25Z

🐛 Bugreport

When using AsyncIterator i have a substential memory leak when used in for-x-of-y

I need this when scraping a HTML-Page which includes the information about the next HTML-Page to be scraped:

Scrap Data
Evaluate Data
Scrape Next Data

The async Part is needed since axios is used to obtain the HTML

Here is a repro, which allows to see the memory rising von ~4MB to ~25MB at the end of the script. The memory is not freed till the program terminates.

const scraper = async ():Promise<void> => {
    let browser = new BrowserTest();
    let parser = new ParserTest();

    for await (const data of browser){
        console.log(await parser.parse(data))
    }
}

class BrowserTest {
    private i: number = 0;

    public async next(): Promise<IteratorResult<string>> {
        this.i += 1;
        return {
            done: this.i > 1000,
            value: 'peter '.repeat(this.i)
        }
    }

    [Symbol.asyncIterator](): AsyncIterator<string> {
        return this;
    }
}

class ParserTest {
    public async parse(data: string): Promise<string[]> {
        return data.split(' ');
    }
}

scraper()

It looks like that the data of the for-await-x-of-y is dangling in memory. The callstack gets huge aswell.

In the repro the Problem could still be handled. But for my actual code a whole HTML-Page stays in memory which is ~250kb each call.

In this screenshot you can see the heap memory on the first iteration compared to the heap memory after the last iteration

The expected workflow would be the following:

Obtain Data
Process Data
Extract Info for the next "Obtain Data"
Free all Memory from the last "Obtain Data"
Use extracted information to restart the loop with new Data obtained.

I am unsure an AsyncIterator is the right choice here to archive what is needed.

Any help/hint would be appriciated!

See: https://stackoverflow.com/questions/58454833/for-await-x-of-y-using-an-asynciterator-causes-memory-leak

The text was updated successfully, but these errors were encountered:

ulfgebhardt · 2019-10-25T14:24:37Z

Contrary to the original Post i now believe that the memory is freed once the loop is completed.

ulfgebhardt · 2019-10-25T14:42:12Z

Requirements for our scrapers:

Data which is dependent on last Dataset:

a references b references c
The Request of the ScraperBrowser must be completed in order to obtain the next Dataset.
To reduce http requests it would be nice to return the html from the Browser and give it to the Parser.

Data which has an index:

index references n urls
The Request of the ScraperBrowser returns n datasets which might be paginated. The Datasets are not needed to obtain the next ones.
To enable parallelization it would be nice to return Promises for the HTML-Requests from the Browser and give it to the Parser.

ulfgebhardt added the 🐛 Bug label Oct 25, 2019

ulfgebhardt added this to the 🏃‍♀️ Sprint 19/2 milestone Oct 25, 2019

ulfgebhardt assigned ulfgebhardt and jrebmann Oct 25, 2019

ManAnRuck modified the milestones: 🏃‍♀️ Sprint 19/2, 🏃‍♀️ Sprint 19/3 Oct 29, 2019

ulfgebhardt closed this as completed Nov 6, 2019

ulfgebhardt mentioned this issue Nov 6, 2019

Temporary Memory Leak with Async Iterators in For Await X of Y nodejs/node#30298

Closed

ManAnRuck added this to Democracy Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Scraper Memory usage #926

🐛 Scraper Memory usage #926

ulfgebhardt commented Oct 25, 2019 •

edited

Loading

ulfgebhardt commented Oct 25, 2019

ulfgebhardt commented Oct 25, 2019 •

edited

Loading

🐛 Scraper Memory usage #926

🐛 Scraper Memory usage #926

Comments

ulfgebhardt commented Oct 25, 2019 • edited Loading

🐛 Bugreport

ulfgebhardt commented Oct 25, 2019

ulfgebhardt commented Oct 25, 2019 • edited Loading

Data which is dependent on last Dataset:

Data which has an index:

ulfgebhardt commented Oct 25, 2019 •

edited

Loading

ulfgebhardt commented Oct 25, 2019 •

edited

Loading