You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the first, and most major step in a complete overhaul of thresher.
The purpose of this is to support the current and near-future needs of
scraperJSON, based on revisiting the design and incorporating a lot of
user feedback.
Major changes:
- all scraping functionality has been moved to the Scraper class
- the Thresher class now only handles selecting a scraper by URL, and running it
- ScraperBox class holds a collection of scrapers and can match them to URLs
- all logging has been removed and the entire module now operates using events
scraperJSON features implemented:
- elements can be nested (fixes#2 and ContentMine/scraperJSON#3)
- elements can depend on 'following' the captured URLs from other elements (fixes#6)
- URLs are resolved (and all redirects followed) before scraping (fixes#10)
- headless pre-rendering is no longer default (for a massive speed/efficiency increase)
See ContentMine/thresher#2
The text was updated successfully, but these errors were encountered: