Scraper is a terminal web-scraper, which does content from url more readable.
- Scraper gets html content from url
- Clears it from tags, unnecessary garbage and so on
- Formats the content, according set format (currently formatting supported only for links)
- Saves the content to file with path, which gotten from url (for instance, https://microsoft.com/ -> ".\microsoft.com\index.txt")
You can also:
- Set content line length
- Get logs from stdout
Scraper uses the follow libraries:
- json - to load config file from json
- argparse - to get arguments from terminal to run
- requests - to get html-content from url
- re - to search tags, links and expressions in html-content
- pathlib - to save file in the path from url
To run scraper using terminal, it is necessary to have Python with 3.7+ version. When it will be installed, you can run scraper.
Sample:
python main.py --url "https://habr.com/ru/post/446816/"
or
python main.py -u "https://habr.com/ru/post/446816/"
- Write Tests
- Add support for relative links
- Extend support for html tags
- Add pagination for scraper.py module
- Refactor/extend the content formatting logic
- Add a more informative logger