Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Scraper Memory usage #926

Closed
ulfgebhardt opened this issue Oct 25, 2019 · 2 comments
Closed

🐛 Scraper Memory usage #926

ulfgebhardt opened this issue Oct 25, 2019 · 2 comments

Comments

@ulfgebhardt
Copy link
Member

ulfgebhardt commented Oct 25, 2019

🐛 Bugreport

When using AsyncIterator i have a substential memory leak when used in for-x-of-y

I need this when scraping a HTML-Page which includes the information about the next HTML-Page to be scraped:

  1. Scrap Data
  2. Evaluate Data
  3. Scrape Next Data

The async Part is needed since axios is used to obtain the HTML

Here is a repro, which allows to see the memory rising von ~4MB to ~25MB at the end of the script. The memory is not freed till the program terminates.

const scraper = async ():Promise<void> => {
    let browser = new BrowserTest();
    let parser = new ParserTest();

    for await (const data of browser){
        console.log(await parser.parse(data))
    }
}

class BrowserTest {
    private i: number = 0;

    public async next(): Promise<IteratorResult<string>> {
        this.i += 1;
        return {
            done: this.i > 1000,
            value: 'peter '.repeat(this.i)
        }
    }

    [Symbol.asyncIterator](): AsyncIterator<string> {
        return this;
    }
}

class ParserTest {
    public async parse(data: string): Promise<string[]> {
        return data.split(' ');
    }
}

scraper()

It looks like that the data of the for-await-x-of-y is dangling in memory. The callstack gets huge aswell.

In the repro the Problem could still be handled. But for my actual code a whole HTML-Page stays in memory which is ~250kb each call.

In this screenshot you can see the heap memory on the first iteration compared to the heap memory after the last iteration

Cannot post inline Screenshots yet

The expected workflow would be the following:

  • Obtain Data
  • Process Data
  • Extract Info for the next "Obtain Data"
  • Free all Memory from the last "Obtain Data"
  • Use extracted information to restart the loop with new Data obtained.

I am unsure an AsyncIterator is the right choice here to archive what is needed.

Any help/hint would be appriciated!

See: https://stackoverflow.com/questions/58454833/for-await-x-of-y-using-an-asynciterator-causes-memory-leak

@ulfgebhardt
Copy link
Member Author

Contrary to the original Post i now believe that the memory is freed once the loop is completed.

@ulfgebhardt
Copy link
Member Author

ulfgebhardt commented Oct 25, 2019

Requirements for our scrapers:

Data which is dependent on last Dataset:

a references b references c
The Request of the ScraperBrowser must be completed in order to obtain the next Dataset.
To reduce http requests it would be nice to return the html from the Browser and give it to the Parser.

Data which has an index:

index references n urls
The Request of the ScraperBrowser returns n datasets which might be paginated. The Datasets are not needed to obtain the next ones.
To enable parallelization it would be nice to return Promises for the HTML-Requests from the Browser and give it to the Parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants