An efficient command line python library to scrape webpages and download the relevant pages as PDFs. It is optimised to be fast and is beneficial for people who do not have access to high-speed internet and require webpages for offline reading.
$ pip install bigheads
Note that bigheads
depends on bs4
, httplib2
, wkhtmltopdf
and pdfkit
.
IOError: No wkhtmltopdf executable found:
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
Visit the URL to see the steps for your respective OS.
$ bigheads URL [OPTIONS]
-t, --tags [TAG1, TAG2 ....] Space delimited list of tags for which web articles will be scraped.
-d, --directoryPath DIR_PATH Path of the directory where PDF files are to be saved.
-l, --limit LIMIT Limit on the number of articles to be scraped.
Recursively scrape all URLs from the provided seed URL and save the PDFs to a default folder.
$ bigheads https://www.geeksforgeeks.org/tag/queue
Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags.
$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft
Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags and store it on the provided path.
$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft -d queue
Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags and a provided limit is not breached. The limit is for the number of articles whose PDFs is downloaded to the device.
$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft -d queue -l 100
- By default, a new folder called "pdfs_" will be created in the working directory, containing all the downloaded PDFs.
- We have a limit of 300 articles that can be scraped in one run. This limit can be increased by parallelising the scraping tasks.
- If some tags are provided, articles that match all the provided tags will be considered and scraped to form the downloaded PDFs.
- Lot of noise files may be downloaded because the mechanism to compute relevance by matching tags is a naive approach and not very consistent.
- For large number of recursive URLs, the current routine to convert into PDFs takes more time. There is scope to parallelise these tasks into different batches.
If you want to add features, improve them, or report issues, feel free to send a pull request or leave a comment.
bigheads is a tool to ease an user's convenience to go through webpages in the absence of internet. It should not be misused or used for purposes other than education/research. The user agrees to USE IT AT THEIR OWN RISK.
Answer - Well, I wrote this program in 2015. Distributed the initial version to PyPi in 2016. In 2020, when I was revisiting this code I had no clue why I had named this package 'bigheads' because the name had no correlation with the work the code was doing. It took me about 30 minutes to recall that the name is based on the famous character 'Bighead' from the TV Series Silicon Valley.
I never thought I would reference a TV Series character in my work. But at that time I felt the influence of the TV Series was pretty high to inculcate a sense of love for Computer Science, Entrepreneurship and Comedy.