CheerioCrawler - add got-scraping headers persistence per session. #1008

petrpatek · 2021-04-29T14:03:24Z

Describe the feature
If cookie persistence is on the request, headers should be persisted in the session userData object. If no headers are found, the got-scraping will work with default settings. If headers are found, the old headers are going to be used to be consistent.

Inspiration

const { headers } = session.userData;
const response = await gotScraping({
                    url: request.url,
                    proxyUrl,
                    useHeaderGenerator: !headers,
                    headers: {
                        ...headers,
                        cookie: session.getCookieString(request.url),
                    },
                });
 
 session.setCookiesFromResponse(response);
 session.userData.headers = response.request.options.headers;

Motivation
got-scraping automatically generates headers for each request, which means that even if the SessionPool cookie persistence is enabled, it generates new headers over and over, which can cause serious blocking issues.

Constraints
I don't know about any.

The text was updated successfully, but these errors were encountered:

szmarczak · 2021-08-09T09:20:19Z

I think we can fix this in got-scraping? We could use WeakMap with Symbols as keys but Symbols as keys aren't supported yet. Alternatively we could expose something like this:

gotScraping.prepareHeaders('unique-session-id')

await gotScraping({
	...,
	headersSessionId: 'unique-session-id'
})

// perform all the requests

gotScraping.removeHeaders('unique-session-id')

What if a request fails? When to change the generated headers?

mnmkng · 2021-08-09T09:31:57Z

I'm not sure if it makes sense to add this to the got-scraping API, but I don't oppose it either.

The SessionPool works in a way that if a request fails due to blocking (timeouts, 403s...) it increments the session's error score. Once it crosses a certain treshold, the session will be discarded. So technically the Pool should take care of the request error problems, because it will no longer use the associated session ID anymore.

So technically, yeah. The ability to save the headers and then remove them would be compatible with how SessionPool works. We could do:

sessionPool.on('sessionRetired', (s) => gotScraping.removeHeaders(s.id))

petrpatek · 2021-08-09T09:39:17Z

I think about this more like an SDK feature than a got-scraping feature because of the error handling on the session (proxy IP included), as @mnmkng mentioned.

mnmkng · 2021-08-09T09:43:09Z

On the other hand, even if we forget about the whole SessionPool, got-scraping probably should have some API to control the generated headers somehow. Now it just randomly generates headers, but there's no way to "lock" the working ones or discard the crappy ones, when you're using got-scraping only. So maybe some generic API in got-scraping would make sense.

szmarczak · 2021-08-09T13:41:24Z

Click for example

const http2 = require('http2-wrapper');
// Works without .default as well but with .default IDEs give hints as it's a TypeScript module
const gotScraping = require('.').default;

const { context } = gotScraping.defaults.options;
const instance = gotScraping.extend({
    context: {
        headers: {
            1: context.headerGenerator.getHeaders({
                httpVersion: '1',
                ...context.headerGeneratorOptions,
            }),
            2: context.headerGenerator.getHeaders({
                httpVersion: '2',
                ...context.headerGeneratorOptions,
            }),
        },
        useHeaderGenerator: false,
    },
    hooks: {
        beforeRequest: [
            async (options) => {
                const { url } = options;

                const protocol = await http2.auto.resolveProtocol({
                    host: url.hostname,
                    port: url.port || 443,
                    rejectUnauthorized: false,
                    ALPNProtocols: ['h2', 'http/1.1'],
                    servername: url.hostname,
                });

                let headers;
                if (protocol === 'h2') {
                    headers = options.context.headers[2]; // eslint-disable-line prefer-destructuring
                    options.ALPNProtocols = ['h2'];
                } else {
                    headers = options.context.headers[1]; // eslint-disable-line prefer-destructuring
                    options.ALPNProtocols = ['http/1.1'];
                }

                // TODO: proper merge
                Object.assign(options.headers, headers);
            },
        ],
    },
});

console.log(instance.defaults.options.context);

(async () => {
    // httpbin.org always normalizes the headers to Pascal-Case
    const response = await instance('https://httpbin.org/anything', {
        responseType: 'json',
    }).on('request', (request) => {
        request.once('response', () => {
            console.log(request.socket.alpnProtocol);
        });
    });

    console.log(response.body.headers['User-Agent']);
})();

mnmkng · 2021-08-09T14:21:48Z

This is a good showcase of the implementation, but more importantly we need to see how the user interface would look like.

szmarczak · 2021-08-19T15:09:34Z

It would be also good for the headers to be related with each other. E.g. if h1 contains dnt then h2 should have it as well.

mnmkng · 2021-08-20T06:15:44Z

I don't think that's needed. I might be wrong, but I assume a website either accepts H2 or doesn't. So once we use H2 we will always use H2 for that website. No need to have matching H1 headers then.

szmarczak · 2021-08-20T11:02:22Z

I think that header-generator can be used in Session in the constructor. It would generate headers for HTTP/1 and HTTP/2 and pass them in requestOptions.context.headers. Then got-scraping would merge those. Wdyt?

mnmkng · 2021-08-20T11:10:27Z

I think it's a bit overcomplicated. We can do it two ways IMO.

Let got-scraping handle it completely. Accept an ID on input and either generate or retrieve generated headers from cache.
Let sessions handle it by saving whatever the generated headers were from the got-scraping request object and then always using those, which should override any generated headers.

Personally I would go with 1. because we want to replace SessionPool with UserPool in the future so who knows how it will look.

szmarczak · 2021-08-20T12:53:28Z

If the ID is an object then we can store the headers in a WeakMap and not care about cleaning up. Something like:

// in Session
this.sessionToken = new Object();

// in `requestAsBrowser` or somewhere else appropriate
gotScraping(..., { context: { sessionToken } })

mnmkng · 2021-08-20T17:32:28Z

Technically, we could use the whole Session as a key. But not sure what the suggested usage would be for people who use only got-scraping. I don't think I've ever seen creating objects as IDs.

szmarczak · 2021-08-20T18:11:53Z

Technically, we could use the whole Session as a key.

Yeah, that would work as well! 👍🏼

But not sure what the suggested usage would be for people who use only got-scraping.

The best way would be to use a symbol as the key. But that's not possible, they can either use new SomeClass() or just {}.

const sessionToken = {}; // it doesn't matter what the object is, it just has to be non-primitive

gotScraping(..., { context: { sessionToken } })

I don't think I've ever seen creating objects as IDs.

Cause it's rare. But that's the point of a WeakMap - we want to store the value (headers) as long as the key (token) exists :)

mnmkng · 2021-08-20T21:16:53Z

Yeah, I know WeakMaps. I'm only concerned that we're trying a bit too hard to use them. When designing APIs we should put the user first. And I think we can safely say that most users expect string IDs. Anything else would be counter-intuitive and would need to be explained in docs. For us, WeakMaps are the "easy way out". But I'm not convinced (yet) it's the best approach for the users.

szmarczak · 2021-08-20T21:33:52Z

I don't think strings are easy way either. We'd need to explain how to generate them so they're unique. See https://nodejs.org/api/crypto.html#crypto_crypto_randomuuid_options

szmarczak · 2021-08-23T08:02:22Z

I prefer WeakMap as for newcomers it might be easier to forget about cleanup 🤔
See apify/got-scraping#46 for proposed solution :)

Dulanjala007 · 2024-02-25T11:36:42Z

what if you can't keep the object around to reuse, like inside of a stateless server environment.
looking for alternative for this just because of this problem...

petrpatek added the feature Issues that represent new features or improvements to existing features. label Apr 29, 2021

szmarczak mentioned this issue Aug 6, 2021

Switch to TypeScript apify/got-scraping#31

Merged

szmarczak closed this as completed Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CheerioCrawler - add got-scraping headers persistence per session. #1008

CheerioCrawler - add got-scraping headers persistence per session. #1008

petrpatek commented Apr 29, 2021

szmarczak commented Aug 9, 2021

mnmkng commented Aug 9, 2021

petrpatek commented Aug 9, 2021

mnmkng commented Aug 9, 2021

szmarczak commented Aug 9, 2021 •

edited

Loading

mnmkng commented Aug 9, 2021

szmarczak commented Aug 19, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021 •

edited

Loading

szmarczak commented Aug 23, 2021

Dulanjala007 commented Feb 25, 2024

CheerioCrawler - add got-scraping headers persistence per session. #1008

CheerioCrawler - add got-scraping headers persistence per session. #1008

Comments

petrpatek commented Apr 29, 2021

szmarczak commented Aug 9, 2021

mnmkng commented Aug 9, 2021

petrpatek commented Aug 9, 2021

mnmkng commented Aug 9, 2021

szmarczak commented Aug 9, 2021 • edited Loading

mnmkng commented Aug 9, 2021

szmarczak commented Aug 19, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021

mnmkng commented Aug 20, 2021

szmarczak commented Aug 20, 2021 • edited Loading

szmarczak commented Aug 23, 2021

Dulanjala007 commented Feb 25, 2024

szmarczak commented Aug 9, 2021 •

edited

Loading

szmarczak commented Aug 20, 2021 •

edited

Loading