You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Feature
It would be great if the status codes _parseResponse() throws on were configurable.
Motivation
Some sites we scrape return 500 (instead of 404) for no longer available product pages, giving us no ability to work with the response via the requestHandler.
Ideal solution or implementation, and any additional constraints
Either a way to provide an explicit list of codes that should throw an error; or a way to provide an ignore list of error codes that should not be treated as errors.
Alternative solutions or implementations
AFAIK, currently the only way to get responses with >500 codes treated as normal responses is to tamper with the status code in a postNavigation() hook.
The text was updated successfully, but these errors were encountered:
corford
added
the
feature
Issues that represent new features or improvements to existing features.
label
Dec 8, 2022
corford
changed the title
Make error codes that _parseResponse() throws on configurable
Make the error codes _parseResponse() throws on configurable
Dec 8, 2022
corford
changed the title
Make the error codes _parseResponse() throws on configurable
Make HttpCrawler error codes configurable
Mar 22, 2023
This commit introduces two new optional properties to `CheerioCrawler`
and `HttpCrawler`, allowing for finer control over how HTTP error status
codes are handled:
1. `ignoreHttpErrorStatusCodes`: An array of HTTP response status codes
that should be excluded from being considered as errors. By default,
error consideration is triggered for status codes >= 500.
2. `additionalHttpErrorStatusCodes`: An array of extra HTTP response
status codes that should be treated as errors. By default, error
consideration is triggered for status codes >= 500.
These options provide flexibility in specifying which HTTP response
codes should be treated as errors and ignored during the crawling
process.
Closes#1711
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/http (HttpCrawler)
Feature
It would be great if the status codes _parseResponse() throws on were configurable.
Motivation
Some sites we scrape return 500 (instead of 404) for no longer available product pages, giving us no ability to work with the response via the
requestHandler
.Ideal solution or implementation, and any additional constraints
Either a way to provide an explicit list of codes that should throw an error; or a way to provide an ignore list of error codes that should not be treated as errors.
Alternative solutions or implementations
AFAIK, currently the only way to get responses with >500 codes treated as normal responses is to tamper with the status code in a
postNavigation()
hook.The text was updated successfully, but these errors were encountered: