Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheerioCrawler - add got-scraping headers persistence per session. #1008

Closed
petrpatek opened this issue Apr 29, 2021 · 17 comments
Closed

CheerioCrawler - add got-scraping headers persistence per session. #1008

petrpatek opened this issue Apr 29, 2021 · 17 comments
Labels
feature Issues that represent new features or improvements to existing features.

Comments

@petrpatek
Copy link
Contributor

Describe the feature
If cookie persistence is on the request, headers should be persisted in the session userData object. If no headers are found, the got-scraping will work with default settings. If headers are found, the old headers are going to be used to be consistent.

Inspiration

const { headers } = session.userData;
const response = await gotScraping({
                    url: request.url,
                    proxyUrl,
                    useHeaderGenerator: !headers,
                    headers: {
                        ...headers,
                        cookie: session.getCookieString(request.url),
                    },
                });
 
 session.setCookiesFromResponse(response);
 session.userData.headers = response.request.options.headers;

Motivation
got-scraping automatically generates headers for each request, which means that even if the SessionPool cookie persistence is enabled, it generates new headers over and over, which can cause serious blocking issues.

Constraints
I don't know about any.

@petrpatek petrpatek added the feature Issues that represent new features or improvements to existing features. label Apr 29, 2021
@szmarczak
Copy link
Contributor

I think we can fix this in got-scraping? We could use WeakMap with Symbols as keys but Symbols as keys aren't supported yet. Alternatively we could expose something like this:

gotScraping.prepareHeaders('unique-session-id')

await gotScraping({
	...,
	headersSessionId: 'unique-session-id'
})

// perform all the requests

gotScraping.removeHeaders('unique-session-id')

What if a request fails? When to change the generated headers?

@mnmkng
Copy link
Member

mnmkng commented Aug 9, 2021

I'm not sure if it makes sense to add this to the got-scraping API, but I don't oppose it either.

The SessionPool works in a way that if a request fails due to blocking (timeouts, 403s...) it increments the session's error score. Once it crosses a certain treshold, the session will be discarded. So technically the Pool should take care of the request error problems, because it will no longer use the associated session ID anymore.

So technically, yeah. The ability to save the headers and then remove them would be compatible with how SessionPool works. We could do:

sessionPool.on('sessionRetired', (s) => gotScraping.removeHeaders(s.id))

@petrpatek
Copy link
Contributor Author

I think about this more like an SDK feature than a got-scraping feature because of the error handling on the session (proxy IP included), as @mnmkng mentioned.

@mnmkng
Copy link
Member

mnmkng commented Aug 9, 2021

On the other hand, even if we forget about the whole SessionPool, got-scraping probably should have some API to control the generated headers somehow. Now it just randomly generates headers, but there's no way to "lock" the working ones or discard the crappy ones, when you're using got-scraping only. So maybe some generic API in got-scraping would make sense.

@szmarczak
Copy link
Contributor

szmarczak commented Aug 9, 2021

Click for example
const http2 = require('http2-wrapper');
// Works without .default as well but with .default IDEs give hints as it's a TypeScript module
const gotScraping = require('.').default;

const { context } = gotScraping.defaults.options;
const instance = gotScraping.extend({
    context: {
        headers: {
            1: context.headerGenerator.getHeaders({
                httpVersion: '1',
                ...context.headerGeneratorOptions,
            }),
            2: context.headerGenerator.getHeaders({
                httpVersion: '2',
                ...context.headerGeneratorOptions,
            }),
        },
        useHeaderGenerator: false,
    },
    hooks: {
        beforeRequest: [
            async (options) => {
                const { url } = options;

                const protocol = await http2.auto.resolveProtocol({
                    host: url.hostname,
                    port: url.port || 443,
                    rejectUnauthorized: false,
                    ALPNProtocols: ['h2', 'http/1.1'],
                    servername: url.hostname,
                });

                let headers;
                if (protocol === 'h2') {
                    headers = options.context.headers[2]; // eslint-disable-line prefer-destructuring
                    options.ALPNProtocols = ['h2'];
                } else {
                    headers = options.context.headers[1]; // eslint-disable-line prefer-destructuring
                    options.ALPNProtocols = ['http/1.1'];
                }

                // TODO: proper merge
                Object.assign(options.headers, headers);
            },
        ],
    },
});

console.log(instance.defaults.options.context);

(async () => {
    // httpbin.org always normalizes the headers to Pascal-Case
    const response = await instance('https://httpbin.org/anything', {
        responseType: 'json',
    }).on('request', (request) => {
        request.once('response', () => {
            console.log(request.socket.alpnProtocol);
        });
    });

    console.log(response.body.headers['User-Agent']);
})();

@mnmkng
Copy link
Member

mnmkng commented Aug 9, 2021

This is a good showcase of the implementation, but more importantly we need to see how the user interface would look like.

@szmarczak
Copy link
Contributor

It would be also good for the headers to be related with each other. E.g. if h1 contains dnt then h2 should have it as well.

@mnmkng
Copy link
Member

mnmkng commented Aug 20, 2021

I don't think that's needed. I might be wrong, but I assume a website either accepts H2 or doesn't. So once we use H2 we will always use H2 for that website. No need to have matching H1 headers then.

@szmarczak
Copy link
Contributor

I think that header-generator can be used in Session in the constructor. It would generate headers for HTTP/1 and HTTP/2 and pass them in requestOptions.context.headers. Then got-scraping would merge those. Wdyt?

@mnmkng
Copy link
Member

mnmkng commented Aug 20, 2021

I think it's a bit overcomplicated. We can do it two ways IMO.

  1. Let got-scraping handle it completely. Accept an ID on input and either generate or retrieve generated headers from cache.

  2. Let sessions handle it by saving whatever the generated headers were from the got-scraping request object and then always using those, which should override any generated headers.

Personally I would go with 1. because we want to replace SessionPool with UserPool in the future so who knows how it will look.

@szmarczak
Copy link
Contributor

If the ID is an object then we can store the headers in a WeakMap and not care about cleaning up. Something like:

// in Session
this.sessionToken = new Object();

// in `requestAsBrowser` or somewhere else appropriate
gotScraping(..., { context: { sessionToken } })

@mnmkng
Copy link
Member

mnmkng commented Aug 20, 2021

Technically, we could use the whole Session as a key. But not sure what the suggested usage would be for people who use only got-scraping. I don't think I've ever seen creating objects as IDs.

@szmarczak
Copy link
Contributor

Technically, we could use the whole Session as a key.

Yeah, that would work as well! 👍🏼

But not sure what the suggested usage would be for people who use only got-scraping.

The best way would be to use a symbol as the key. But that's not possible, they can either use new SomeClass() or just {}.

const sessionToken = {}; // it doesn't matter what the object is, it just has to be non-primitive

gotScraping(..., { context: { sessionToken } })

I don't think I've ever seen creating objects as IDs.

Cause it's rare. But that's the point of a WeakMap - we want to store the value (headers) as long as the key (token) exists :)

@mnmkng
Copy link
Member

mnmkng commented Aug 20, 2021

Yeah, I know WeakMaps. I'm only concerned that we're trying a bit too hard to use them. When designing APIs we should put the user first. And I think we can safely say that most users expect string IDs. Anything else would be counter-intuitive and would need to be explained in docs. For us, WeakMaps are the "easy way out". But I'm not convinced (yet) it's the best approach for the users.

@szmarczak
Copy link
Contributor

szmarczak commented Aug 20, 2021

I don't think strings are easy way either. We'd need to explain how to generate them so they're unique. See https://nodejs.org/api/crypto.html#crypto_crypto_randomuuid_options

@szmarczak
Copy link
Contributor

I prefer WeakMap as for newcomers it might be easier to forget about cleanup 🤔
See apify/got-scraping#46 for proposed solution :)

@Dulanjala007
Copy link

what if you can't keep the object around to reuse, like inside of a stateless server environment.
looking for alternative for this just because of this problem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features.
Projects
None yet
Development

No branches or pull requests

4 participants