GitHub - alecxe/scrapy-beautifulsoup: Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Installation

The package is on PyPI and can be installed with pip:

pip install scrapy-beautifulsoup

Configuration

Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:

pip install html5lib

Motivation

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to "fix" it.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scrapy_beautifulsoup		scrapy_beautifulsoup
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-beautifulsoup

Installation

Configuration

Motivation

About

Releases

Packages

Languages

License

alecxe/scrapy-beautifulsoup

Folders and files

Latest commit

History

Repository files navigation

scrapy-beautifulsoup

Installation

Configuration

Motivation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages