Crawling Articles from Newspapers
- get_rss_providers method from DatbaseHandler class
- enter list of RSS sources in DB
- Crawl RSS feeds of given resources
- persist a <uri (PK), title, source> tuple in DB
- [do a prefiltering]
- crawl URIs and fetch articles
- extract articles
- persist article body in DB
do it all somehow paralell
- Project interpreter will be python3
- try to maintain PEP8 style convention
- make sure your ide uses the .editorconfig
install with pip3 -r requirements.txt
Git: https://github.com/codelucas/newspaper Walkthrough: Newspaper Crawling.ipynb (Jupyter/iPython Notebook) Adding a new source: https://github.com/codelucas/newspaper/blob/master/docs/user_guide/advanced.rst