feature: ability to nest elements #3

blahah · 2014-07-11T12:30:39Z

This is the first, and most major step in a complete overhaul of thresher. The purpose of this is to support the current and near-future needs of scraperJSON, based on revisiting the design and incorporating a lot of user feedback. Major changes: - all scraping functionality has been moved to the Scraper class - the Thresher class now only handles selecting a scraper by URL, and running it - ScraperBox class holds a collection of scrapers and can match them to URLs - all logging has been removed and the entire module now operates using events scraperJSON features implemented: - elements can be nested (fixes #2 and ContentMine/scraperJSON#3) - elements can depend on 'following' the captured URLs from other elements (fixes #6) - URLs are resolved (and all redirects followed) before scraping (fixes #10) - headless pre-rendering is no longer default (for a massive speed/efficiency increase)

blahah added the enhancement label Jul 11, 2014

blahah changed the title ~~feature: ability to next elements~~ feature: ability to nest elements Jul 11, 2014

blahah mentioned this issue Sep 7, 2014

Overhaul ContentMine/thresher#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: ability to nest elements #3

feature: ability to nest elements #3

blahah commented Jul 11, 2014

feature: ability to nest elements #3

feature: ability to nest elements #3

Comments

blahah commented Jul 11, 2014