Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Patreon) New "403 Forbidden" Cloudflare CAPTCHA error with 1.15.3 #1117

Closed
biznizz opened this issue Nov 15, 2020 · 58 comments
Closed

(Patreon) New "403 Forbidden" Cloudflare CAPTCHA error with 1.15.3 #1117

biznizz opened this issue Nov 15, 2020 · 58 comments

Comments

@biznizz
Copy link

biznizz commented Nov 15, 2020

Have all dependencies and gallery-dl up to date, but have been getting constant 403 errors.

Made sure that session_id cookie was up to date in config, no dice.

Exporting all Patreon cookies into a cookies.txt and updating config to point to it leads to an error which reads "No session_id set", but downloads free posts.

Cloudflare has been blocking for about a whole day at this point, and verbose doesn't really give any useful information.

@Butterfly-Dragon
Copy link

yup, can confirm

@mikf
Copy link
Owner

mikf commented Nov 15, 2020

What version of requests and urllib3 are both of you using? There was an update for both of them in the last couple of days. Maybe that causes these problems?

pip install -U requests==2.24.0 urllib3==1.25.11 should revert to the older versions.
The 1.15.3 .exe is also using the older versions of those libraries.

Exporting all Patreon cookies into a cookies.txt and updating config to point to it leads to an error which reads "No session_id set", but downloads free posts.

There is no other warning/error message and the cookies.txt file actually contains a session_id cookie for patreon?

@Butterfly-Dragon
Copy link

oh, let me see...

@Butterfly-Dragon
Copy link

urllib3 1.26.2
requests 2.25.0
requests-oauthlib 1.3.0

@Butterfly-Dragon
Copy link

yup. can confirm it now totally works. 😅

@Butterfly-Dragon
Copy link

so... since i'm antsy about keeping old versions of the libraries, should i remove this from my daily updates and let it update only when you tell me to... or what?

@mikf
Copy link
Owner

mikf commented Nov 15, 2020

Either that, or you create a virtualenv for gallery-dl to keep its dependencies separate from all the other Python packages.

@Butterfly-Dragon
Copy link

no, it actually solved several problems i was having the last 2 days. i would just like to skip this version, but right now i just set it to not update those 2 packages, might resume when they get updated.

@biznizz
Copy link
Author

biznizz commented Nov 15, 2020

What version of requests and urllib3 are both of you using? There was an update for both of them in the last couple of days. Maybe that causes these problems?

Everything was as up-to-date as possible, so it would have been requests 2.25.0 and urllib3 1.26.2.

pip install -U requests==2.24.0 urllib3==1.25.11 should revert to the older versions.

Reverted and now the Cloudflare issue is gone. Will the next update of gallery-dl include support for the latest versions of urllib3 and requests? Or should I just keep the dependencies on these older versions for the foreseeable future?

There is no other warning/error message and the cookies.txt file actually contains a session_id cookie for patreon?

Yes, even when the cookies.txt had the session_id cookie in the file and the config was set to read that file, it gave an error like the cookie wasn't there, but it downloaded unlocked posts.

@biznizz
Copy link
Author

biznizz commented Nov 28, 2020

With the latest dev build being 1.16.0.dev0, is it safe to upgrade the dependancies to the latest versions.

Pip has informed me that urllib3, requests, packaging, and cffi all have new versions that can be installed, but I don't want to risk hitting more Cloudflare issues.

@Butterfly-Dragon
Copy link

The fix was given to us 12 days ago.
Urllib3 most recent is from 15 days ago
https://github.com/urllib3/urllib3/releases
And requests is from 17 days ago
https://github.com/psf/requests/releases
Sooo... i guess not?
Packaging has updated last time 10 hours ago and cffi is not developed on github so that should be fine too...

@mikf
Copy link
Owner

mikf commented Nov 28, 2020

It should be OK to update everything except urllib3.

The problem lies in the new default behavior when establishing a TLS connection in urllib3 version 1.26.0 (Changelog). It might be possible to monkey-patch said defaults to how they where before 1.26.0, but this is a lot of grunt work and the fact that gallery-dl doesn't even use this library directly, but only through requests, doesn't help.

@Butterfly-Dragon
Copy link

but... why should using a better encryption give us a 403 error?!? 🤔

@Butterfly-Dragon
Copy link

hmmm... apparently in TLS 2.0 the server looks for the client's certificate to continue, i guess ""we"" are not providing a certificate that works for patreon?

@mikf
Copy link
Owner

mikf commented Dec 3, 2020

This is more about whether or not Cloudflare thinks a request comes from a browser controlled by a human being or a bot, and it uses the TLS handshake among other things to determine that. Why Cloudflare believes requests with urllib3 1.26 are from a bot, but not with 1.25 is beyond me, but at least we know what works, just not the why.

For example the latest Firefox versions only accept TLS 1.2 and 1.3, as does urllib3 1.26 (bot according to Cloudflare) in contrast to urllib3 1.25, which allows TLS 1.0, 1.1, 1.2, and 1.3 (not a bot). Maybe changing gallery-dl's user agent to some newer browser version is all that is needed to make it work with urllib3 1.26? (It currently uses Firefox 68 as user agent)

@Butterfly-Dragon
Copy link

oooh that's the problem!

I'm specifying that i'm chrome 88 as a user agent and giving cookies that come from chrome 88.

So if Urllib3 says it's firefox then they see chrome... if you ask me that's sus 🤣👍💖

@mikf
Copy link
Owner

mikf commented Dec 3, 2020

No, urllib3 doesn't say it's a specific browser, gallery-dl by default is saying it is Firefox 68. When you've already changed gallery-dl's user agent and it still doesn't work with 1.26, we can discard my previous assumption ("Maybe changing gallery-dl's user agent ... to make it work with urllib3 1.26")

@Butterfly-Dragon
Copy link

hmmm... i have not tried "not changing" it though.

@biznizz
Copy link
Author

biznizz commented Dec 13, 2020

I'm going to assume that even with the current 1.16.0 version of gallery-dl, that it's still not safe to upgrade to the latest version of urllib3?

@Butterfly-Dragon
Copy link

Urllib3 has been updated last time on the has been stuck on version 1.26.2 since Nov. 12

@biznizz
Copy link
Author

biznizz commented Dec 17, 2020

Doing a quick update report:

Had upgraded chardet and requests to latest versions while keeping urllib3 at 1.25.11, got errors and downgraded back down to 3.0.4 and 2.24.0 respectively again, and error disappeared.

Gallery-dl is at the latest release version.

@AlphaSlayer1964
Copy link

I believe I have tried everything that has been suggested but still getting the error. Can anyone explain all the versions they are using that makes it work?

@biznizz
Copy link
Author

biznizz commented Dec 26, 2020

I believe I have tried everything that has been suggested but still getting the error. Can anyone explain all the versions they are using that makes it work?

So I'm able to keep gallery-dl up to date with the current version 1.16.0.

To reduce my errors, I've had to keep the following packages at these versions.

chardet 3.0.4
requests 2.24.0
urllib3 1.25.11

There still might be the occasion where Cloudflare might start causing errors with individual post URL's, but running a generic "/posts" URL on a user page should get everything without issue.

@biznizz
Copy link
Author

biznizz commented Dec 31, 2020

Latest report: Every dependency can be upgraded to the latest version except for urllib3 which still causes Cloudflare issues when trying to download either individual or whole user pages from Patreon.

@Butterfly-Dragon
Copy link

thanks! 👍💖

@Ogwalla
Copy link

Ogwalla commented Jan 8, 2021

I am also having issues with Cloudflare.

@Sjoerd82
Copy link

Yes, same here:

gallery-dl.exe -v --cookies "cookie.txt" https://www.patreon.com/posts/12345678
[gallery-dl][debug] Version 1.16.3
[gallery-dl][debug] Python 3.7.9 - Windows-10-10.0.18362
[gallery-dl][debug] requests 2.25.1 - urllib3 1.25.11
[gallery-dl][debug] Starting DownloadJob for 'https://www.patreon.com/posts/12345678'
[patreon][debug] Using PatreonPostExtractor for 'https://www.patreon.com/posts/12345678'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.patreon.com:443
[urllib3.connectionpool][debug] https://www.patreon.com:443 "GET /posts/12345678 HTTP/1.1" 403 None
[patreon][warning] Cloudflare CAPTCHA
[patreon][error] HttpError: '403 Forbidden' for 'https://www.patreon.com/posts/12345678'

I'm not sure if the packaged executable (gallery-dl.exe) uses any of the locally installed Python (perhaps someone can enlighten me on this). In any case I ran this on two machines (same result), one has Python 3.7.0, and one has 3.9.1. -- Given that the -verbose option reports Python 3.7.9, I guess not.

@Hrxn
Copy link
Contributor

Hrxn commented Jan 12, 2021

The packaged executable is fully self-contained, so no, it does not use any locally installed Python or something.

@biznizz
Copy link
Author

biznizz commented Feb 17, 2021

Still on urllib3 1.25.11 and up to date on gallery-dl with 1.16.5, and there are still large amounts of time where trying to rip posts or users from Patreon results in Cloudflare CAPTCHA 403 Forbiddens.

Sometimes it'll work, other times errors.

mikf added a commit that referenced this issue Feb 26, 2021
- change default user agent to Firefox ESR 78 on Windows 10
- remove 'ciphers' option
@mikf
Copy link
Owner

mikf commented Feb 26, 2021

So, there have been some changes in this regard and a new browser option got added (cf5fa75). Setting this option to "firefox" or "chrome" tells gallery-dl to use the same default HTTP headers and TLS ciphers as those two browsers, and this time it takes the local OS into account and it should work regardless of urllib3 version.

Please test this on Patreon user URLs and post URLs with and without this option enabled and let me know if this is more successful than it was before this change.

I do want to make "browser": "firefox" the default for Patreon, but wanted to get some feedback if it actually works first.

@biznizz
Copy link
Author

biznizz commented Feb 26, 2021

I've updated urllib3 to 1.26.3 and ran several User (as in the entirety of the Patreon timeline and Posts (as in individual posts) after updating to the latest dev build of gallery-dl.

No Cloudflare issues so far, but will keep you updated.

For record, my extractor settings in config has this structure for Patreon:

"patreon": {
	   "browser": "firefox",
            "cookies": {
                "session_id": "secret code"
            }
        },

@flaccidbagel
Copy link

flaccidbagel commented Feb 28, 2021

At least so far, trying out the browser option doesn't seem to be producing any different results, neither with firefox nor chrome; downloads are still failing with a 403.

Whatever changed was very recent, as I had just utilized the tool a few days ago.

@Butterfly-Dragon
Copy link

i was able to do0wnload from patreon fine with the updated urllib3 and "chrome" setting

@biznizz
Copy link
Author

biznizz commented Feb 28, 2021

Well, since things seem to mostly be resolved, I"m going to close this particular issue.

If anything pops up with Cloudflare and the new browser option, me or someone else will start a new ticket.

@biznizz biznizz closed this as completed Feb 28, 2021
@Butterfly-Dragon
Copy link

iiit's back.

apparently i have to re-login and even then it lasts exactly for 1 session.

@biznizz
Copy link
Author

biznizz commented Mar 1, 2021

iiit's back.

apparently i have to re-login and even then it lasts exactly for 1 session.

? Re-log in? I mean, other than replacing the session_id cookie about once a month, what exactly is it saying to do?

I've also not had any errors (so far) when running multiple runs of User or Post urls.

Can you post how you have your Patreon extractor settings are in your config?

@Butterfly-Dragon
Copy link

sute thing. General part of the config:

{
    "extractor":
    {
       "base-directory": "D:/Downloads/Downloader/Patreon/",
	    "directory": ["files"],
	    "filename": "{filename}.{extension}",
        "archive": "D:/Downloads/Downloader//!Downloader/SQL/gallery-dl-patreon-archive.sqlite3",
	    "cache":
			{
			"file": "D:/Downloads/Downloader//!Downloader/SQL/tmp/cache.sqlite3"
			},
        "skip": "abort:10",
		"retries": -1,
        "sleep": 0,
		"timeout": null,
        "postprocessors": null,
		"cookies": "D:/Downloads/Downloader//!Downloader/cookies.txt",
		"parent-directory": true,
		"adjust-extensions": true,
		"refresh-token": "cache",
		"browser": "chrome",
        "oauth":
        {
            "browser": true
        },

patreon specific:

		"patreon":
		{
			"directory": ["{creator[full_name]}"],
			"filename": "{creator[full_name]} {creator[vanity]} - {date} {title} {num:03} - {filename}.{extension}"
		},

ending:

    "downloader":
    {
        "part-directory": null,
        "rate": null,
        "retries": -1,
        "timeout": 30.0,
        "part": true,

        "http":
        {
            "mtime": true,
            "rate": null,
            "retries": -1,
            "timeout": 30.0,
            "verify": true
        },

        "ytdl":
        {
            "forward-cookies": true,
            "mtime": true,
            "rate": null,
            "retries": -1,
            "timeout": 30.0,
            "verify": true,
			"format": "bestvideo+bestaudio/best",
			"outtmpl": null
        }
    },

    "output":
    {
        "mode": "auto",
        "progress": true,
        "shorten": true,
        "log": {
            "level": "info",
            "format": {
                "debug"  : "\u001b [debug   {name}: {message}\u001b ]",
                "info"   : "\u001b [info    {name}: {message}\u001b ]",
                "warning": "\u001b [warning {name}: {message}\u001b ]",
                "error"  : "\u001b [error   {name}: {message}\u001b ]"
            }
        },
        "logfile": {
            "path": "D:/Downloads/Downloader/RoboLogPatreon.txt",
            "mode": "w",
            "level": "debug"
        }
    },

    "cache": {
        "file": "D:/Downloads/Downloader//!Downloader/SQL/cache.sqlite3"
    },

    "netrc": false
}

@biznizz
Copy link
Author

biznizz commented Mar 1, 2021

@Butterfly-Dragon

How strange, you don't plug your session_id cookie directly into your config file? You have it read directly from your cookies.txt file in your postprocessors setting?

Try to use the settings I posted for my Patreon extractor, where "browser: "firefox" and the cookie is directly plugged in and see if that improved anything.

@Butterfly-Dragon
Copy link

That is cookies from chrome.
I don't have firefox.

@biznizz
Copy link
Author

biznizz commented Mar 1, 2021

I don't use Firefox either (I used Waterfox, a forked browser).

I'm thinking that the use of the browser setting is to emulate how the browser would get past the captcha. Either way, it couldn't hurt to try.

@rautamiekka
Copy link
Contributor

I don't use Firefox either (I used Waterfox, a forked browser).

fork = Waterfox == Firefox. It's not another browser simply by being a fork; forks are nearly always the same type as the original, that's why you mostly can use Firefox addons in Waterfox.

@Butterfly-Dragon
Copy link

yeh, if i had cookies from microsoft edge (for... masochistic reasons, i guess) i would still need to set it as chrome, since it's a chromium browser.

@biznizz biznizz reopened this Mar 1, 2021
@biznizz
Copy link
Author

biznizz commented Mar 1, 2021

Well, either way, I'm sure that whatever browser you're using is moot since I'm sure that, as I said earlier, it's more about emulating browser behavior using those settings, with Chrome or Firefox settings.

I'm sure mikf can explain this. I'll reopen the issue since I started this ticket.

@mikf
Copy link
Owner

mikf commented Mar 4, 2021

Well, I can only make assumptions. For one, the browser "emulation" isn't particularly good. gallery-dl through requests/urllib3 only uses HTTP/1.1, while all modern browsers use HTTP/2. Chrome also sends a lot of HTTP/2 specific headers (:authority, ..., sec-fetch-mode, ...) which don't get sent by gallery-dl.

Then again, both browser=firefox as well as browser=chrome work on my machine, while browser=null causes the usual 403 Forbidden. browser=firefox is actually more or less the same as before, except a slight bit better when it comes to accuracy in terms of HTTP headers and TLS handshake.

Did you at least try browser=firefox and can confirm it also doesn't work, @Butterfly-Dragon ?
Could you try it with the same setup as biznizz, i.e. only the session_id cookie for Patreon?
What about gallery-dl v1.16.5 with urllib3 1.25.11?
Can you connect to Patreon with cloudscraper, or does even that result in a CAPTCHA for you?

Also, as the others already said, cookie origin shouldn't matter here.

@Butterfly-Dragon
Copy link

i reverted to urllib3 1.25.11 this afternoon as a test without re-logging into patreon and re-exporting the cookies.

It downloaded fine.

Aside from the fact any post that were blocked from my subscribed tiers would reset the count of "skip", which ... i do not know if that was the desired effect.

But yeah, worked absolutely fine with the urllib3 downgrade.

without it i have to re-login and re-export the cookies daily.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 4, 2021

@mikf You probably already know this, but just in case: https://github.com/encode/httpx

This might be a viable alternative to move on from the requests/urllib3 combo.
I'm aware that this could require some deep rewrites of internals, but on the other hand, you've mentioned planned structural changes for a possible (maybe) future 2.0 version, if I recall that correctly. So, I don't know, but maybe this is something worth trying.

I'm not really in a position to judge here, so all I could do so far, given the presence of all these projects on this "social network" for software, was doing a bit of cross-referencing/researching/stalking the contributor page.
Seems good so far 👍 Looks like they are all involved in developing and maintaining important core projects of the Python ecosystem.

@biznizz
Copy link
Author

biznizz commented Mar 4, 2021

I can confirm that I can still download User and individual Posts from Patreon just fine, with the settings I've posted earlier, with urllib3 still up-to-date. No Cloudflare issues with them at all since putting browser setting in the Patreon extractor.

I can't imagine why Butterfly would have to export their cookies every day, I've still only had to re-export them once a month due to them having a short lifespan.

@Butterfly-Dragon
Copy link

using chrome cookies still does not work. :(

@mikf
Copy link
Owner

mikf commented May 4, 2021

@Butterfly-Dragon
And they probably won't until we have HTTP/2 support.
Or maybe you have too many identifying cookies, if that makes sense.
Try manually specifying the session_id cookie for Patreon instead of using your massive cookies.txt for everything, combined with "browser": "firefox":

"patreon": {
    "cookies": {"session_id": "..."},
    "browser": "firefox"
}

@Butterfly-Dragon
Copy link

		"patreon":
		{
			"browser": "firefox",
            "cookies":
			{
                "session_id": "[REDACTED]"
            },
			"directory": [".", "{creator[full_name]}"],
			"filename": "{creator[full_name]} {creator[vanity]} - {date} {title} {num:03} - {filename:L100/filename too long/}.{extension}"
		},

yeah, for now this is what those lines look like in my json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests