last updated: October 23, 2020
This repository contains all the files for sending Hong Kong news articles on Wisenews to CSRP colleagues. The class WisenewsScraper also contains a routine to save scraped articles to a local MongoDB collection.
This is a standalone version of Wisenews scraper. Also check repository openup-triage-server
for an integrated solution for Django.
- a working python virtual environment - follow Python or Conda documentation to set up one if you haven't already done so.
Selenium
>= 3.141.0openpyxl
>= 2.6.3pymongo
>= 3.8.0jupyter-core
>=4.5 and associated packages if running Jupyter Notebook file
Google Chrome
- https://www.google.com/chrome/ - for selenium controllerChromeDriver
- https://chromedriver.chromium.org/ - select a version to matchGoogle Chrome
MongoDB
>= 4.0
- Open
python wisenews.py
with a text editor - recommended ones arevim
,sublime
, ornotepad++
- modify the global variable
WISENEWS_NEWS_SECTIONS
to select a subset of news articles to download from Wisenews - modify tuples in the enum
Keywords
to tailor keywords for searching news articles - change the chromedriver path accordingly
- If this is your first time running this scraper, create
.env
file as follows:
cat env_template > .env # this will create a new environment file called .env using env_template as base
Open and edit your .env
credentials accordingly:
# Wisenews Login
HKU_LOGIN='HKU_PID_HERE' # your HKU login. Must be a real one.
HKU_PASSWORD='HKU_PASSWORD_HERE' # your HKU password. Must be a real one.
SENDER='SENDER_NAME_HERE' # i.e. Byron
FROM_EMAIL='SENDER EMAIL HERE' # i.e. byron@csrp.hku.hk
TO_EMAIL='RECEPIENT EMAIL HERE' # i.e. staff@csrp.hku.hk
- Source into the python virtual environment.
- Either: a) In
jupyter notebook
run the notebook fileWisenews.ipynb
or b) enterpython wisenews.py
in bash.
Full details on usage: see the main
function of wisenews.py
and the Jupyter notebook Wisenews.ipynb