scrapy spiders to crawl the financial data pertinent to train word vectors.
bloomberg
- Bloomberg news articlesinvestopedia
- Definitions of finance terms from investopedia.comwikipedia
- Finance pages from wikipedia - all wiki pages reachable from https://en.wikipedia.org/wiki/Outline_of_finance with at most 2 hops.qplum
- Investment articles from https://www.qplum.co/investing-library
- Install scrapy.
pip3 install scrapy
- Run the
scrapy crawl
command.
(py3) hardik@shire:~/scrapy-finance$ scrapy crawl bloomberg
Please look at the specific spider files like wikipedia.py
. They are relatively easy to follow and modify.
.
├── LICENSE
├── README.md
├── scrapy.cfg
└── text
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── bloomberg.py
├── __init__.py
├── investopedia.py
├── qplum.py
└── wikipedia.py
- The text data is written in the lower case at the moment in all spiders.
- This is not checked with python2.
Please feel free to submit a pull request to add relevant spiders.
MIT