A few utilities for BeautifulSoup.
Required: BeautifulSoup, toolz, utils.
Like in the BeatifulSoup documentation, the following extract from Alice in Wonderland is used as an example throughout:
>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
>> soup = BeautifulSoup(html_doc, 'lxml')
print_tags: print one or more tags, excluding any nested content (alias pt).
>> pt(soup.find_all("p"))
<p class="title">
<p class="story">
<p class="story">
print_path: print the path from the root down to a tag (alias pp).
>> pp(soup.find(id="link1"))
<html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
find_tags: apply a sequence of find methods to a collection of tags. Each method must map a tag to one or more tags. The end result may contain duplicates, which can be removed using remove_duplicate_tags. For convenience, the following partial methods are defined: all_, next_, prev_, parents_, next_siblings_, prev_siblings_, select_, exclude_, restrict_. These take the same parameters as find_all, find_all_next, find_all_previous, find_parents, find_next_siblings, find_previous_siblings, select, exclude_tags and restrict_tags.
>> find_tags(soup, "p", all_(href=True), next_siblings_(limit=1))
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
find_tag: same as find_tags but returns just the first result (or None).
>> find_tag(soup, "a", parents_(limit=1))
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
re_exclude: a negated regex filter.
>> find_tags(soup, all_("a", string=re_exclude("E.*e")))
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
is_parent/is_child/is_ancestor/is_descendent/is_before/is_after: tag comparisons. Useful for exclusions/restrictions, below.
>> is_parent(soup.find(class_="story"), soup.find(id="link1"))
True
>> is_before(soup.find(class_="title"), soup.find(id="link1"))
True
exclude_tags: filter out tags that are related to at least one of an excluded set.
>> exclude_tags(soup.find_all("a"), soup.find(id="link2"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>> exclude_tags(soup.find_all("p"), soup.find_all("a"), is_parent)
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">...</p>]
restrict_tags: restrict to tags that are related to at least one of an included set.
>> restrict_tags(soup.find_all("a"), soup.find(id="link2"))
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>> restrict_tags(soup.find_all("p"), soup.find_all("a"), is_parent)
[<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>]
remove_duplicate_tags: remove duplicate tags in a list, preserving ordering (useful with find_tags).
>> find_tags(soup, "a", next_siblings_())
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>> remove_duplicate_tags(_)
(<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)