-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some hosts return 404/503/non-200 when links are checked #165
Comments
I've found something similar. I believe it's because this service is fronted by CloudFlare which, not recognising the source of the request, serves up a CAPTCHA page with a 403 instead of the resource. I guess the fix would be to manipulate the requests htmltest makes so that it looks more like a real browser, but that seems non-trivial. |
I've done some testing on URLs here using htmltest unchanged and configured with a curl user agent and the range header we add removed. No change to behaviour from upstream hosts.
|
I can provide exact examples that work fine with curl but don't succeed with htmltest. This is reliably reproducible. What kind of logs/output would help you verify? |
@arranf Just a list of urls you've found problematic. I've not pushed the branch but have been adding these as a unit test to help track. I'm then planning on tweaking request params (as above trying to pretend to be curl) to try and identify what's causing these to be blocked. I doubt we'll have this completely fixed for all hosts but am hoping for an improvement. |
This is a list copied from my |
I found a couple more:
|
And also https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 fine for curl, 500 for htmltest. |
Describe the bug
Checks of external links to media resources hosted on twitter, such as
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg
report 404, although curl has not issues with that:Here is the error from htmltest
To Reproduce
Steps to reproduce the behaviour:
https://pbs.twimg.com/media/EuF4GgyXUAEZ3j5?format=jpg
.htmltest.yml
bare config
Expected behaviour
An error is not reported since the resource is available.
Actual behaviour
404 is returned
The text was updated successfully, but these errors were encountered: