Skip to content

Commit

Permalink
(#23, #25, #34) Document chroot tricks
Browse files Browse the repository at this point in the history
This commit adds the resolutions to #23 and #25 to README.md under a
new tips & tricks section. Full source code for each is given in a new
example as well.

Additionally a backtracking bug was found in `chroot` while
documenting these tricks and is fixed in this commit as well.
  • Loading branch information
fimad committed May 29, 2016
1 parent 9eb4bba commit 9be1886
Show file tree
Hide file tree
Showing 7 changed files with 215 additions and 3 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## HEAD

- Added resolution to #23 and #25 to README.md and added examples.

## 0.3.1

- Added the `innerHTML` and `innerHTMLs` scraper.
Expand Down
97 changes: 97 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,3 +109,100 @@ allComments = scrapeURL "http://example.com/article.html" comments
imageURL <- attr "src" $ "img" @: [hasClass "image"]
return $ ImageComment author imageURL
```

Tips & Tricks
-------------

The primitives provided by scalpel are intentionally minimalistic with the
assumption being that users will be able to build up complex functionality by
combining them with functions that work on existing type classes (Monad,
Applicative, Alternative, etc.).

This section gives examples of common tricks for building up more complex
behavior from the simple primitives provided by this library.

### Complex Predicates

It is possible to run into scenarios where the name and attributes of a tag are
not sufficient to isolate interesting tags and properties of child tags need to
be considered.

In these cases the `guard` function of the `Alternative` type class can be
combined with `chroot` and `Any` to implement predicates of arbitrary
complexity.

Building off the above example, consider a use case where we would like find the
html contents of a comment that mentions the word "cat".

The strategy will be the following:

1. Isolate the comment div using `chroot`.

2. Then within the context of that div the textual contents can be retrieved
with `text Any`. This works because the first tag within the current context
is the div tag selected by chroot, and the `Any` selector will match the
first tag within the current context.

3. Then the predicate that `"cat"` appear in the text of the comment will be
enforced using `guard`. If the predicate fails, scalpel will backtrack and
continue the search for divs until one is found that matches the predicate.

4. Return the desired HTML content of the comment div.

```haskell
catComment :: Scraper String String
catComment =
-- 1. First narrow the current context to the div containing the comment's
-- textual content.
chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
-- 2. Any can be used to access the root tag of the current context.
contents <- text Any
-- 3. Skip comment divs that do not contain "cat".
guard ("cat" `isInfixOf` contents)
-- 4. Generate the desired value.
html Any
```

For the full source of this example, see
[complex-predicates](https://github.com/fimad/scalpel/tree/master/examples/complex-predicates/)
in the examples directory.

### Generalized Repetition

The pluralized versions of the primitive scrapers (`texts`, `attrs`, `htmls`)
allow the user to extract content from all of the tags matching a given
selector. For more complex scraping tasks it will at times be desirable to be
able to extract multiple values from the same tag.

Like the previous example, the trick here is to use a combination of the
`chroots` function and the `Any` selector.

Consider an extension to the original example where image comments may contain
some alt text and the desire is to return a tuple of the alt text and the URLs
of the images.

The strategy will be the following:

1. to isolate each img tag using `chroots`.

2. Then within the context of each img tag, use the `Any` selector to extract
the alt and src attributes from the current tag.

3. Create and return a tuple of the extracted attributes.

```haskell
altTextAndImages :: Scraper String [(String, URL)]
altTextAndImages =
-- 1. First narrow the current context to each img tag.
chroots "img" $ do
-- 2. Use Any to access all the relevant content from the the currently
-- selected img tag.
altText <- attr "alt" Any
srcUrl <- attr "src" Any
-- 3. Combine the retrieved content into the desired final result.
return (altText, srcUrl)
```

For the full source of this example, see
[generalized-repetition](https://github.com/fimad/scalpel/tree/master/examples/generalized-repetition/)
in the examples directory.
44 changes: 44 additions & 0 deletions examples/complex-predicates/Main.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import Text.HTML.Scalpel
import Control.Applicative
import Control.Monad
import Data.List (isInfixOf)


exampleHtml :: String
exampleHtml = "<html>\
\ <body>\
\ <div class='comments'>\
\ <div class='comment container'>\
\ <span class='comment author'>Sally</span>\
\ <div class='comment text'>Woo hoo!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Bill</span>\
\ <img class='comment image' src='http://example.com/cat.gif' />\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Bertrand</span>\
\ <div class='comment text'>That sure is some cat!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Susan</span>\
\ <div class='comment text'>WTF!?!</div>\
\ </div>\
\ </div>\
\ </body>\
\</html>"

main :: IO ()
main = print $ scrapeStringLike exampleHtml catComment

catComment :: Scraper String String
catComment =
-- 1. First narrow the current context to the div containing the comment's
-- textual content.
chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do
-- 2. Any can be used to access the root tag of the current context.
contents <- text Any
-- 3. Skip comment divs that do not contain "cat".
guard ("cat" `isInfixOf` contents)
-- 4. Generate the desired value.
html Any
42 changes: 42 additions & 0 deletions examples/generalized-repetition/Main.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import Text.HTML.Scalpel


exampleHtml :: String
exampleHtml = "<html>\
\ <body>\
\ <div class='comments'>\
\ <div class='comment container'>\
\ <span class='comment author'>Sally</span>\
\ <div class='comment text'>Woo hoo!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Bill</span>\
\ <img alt='A cat picture.' \
\ class='comment image' src='http://example.com/cat.gif' />\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Susan</span>\
\ <div class='comment text'>WTF!?!</div>\
\ </div>\
\ <div class='comment container'>\
\ <span class='comment author'>Bill</span>\
\ <img alt='A dog picture.' \
\ class='comment image' src='http://example.com/dog.gif' />\
\ </div>\
\ </div>\
\ </body>\
\</html>"

main :: IO ()
main = print $ scrapeStringLike exampleHtml altTextAndImages

altTextAndImages :: Scraper String [(String, URL)]
altTextAndImages =
-- 1. First narrow the current context to each img tag.
chroots "img" $ do
-- 2. Use Any to access all the relevant content from the the currently
-- selected img tag.
altText <- attr "alt" Any
srcUrl <- attr "src" Any
-- 3. Combine the retrieved content into the desired final result.
return (altText, srcUrl)
16 changes: 16 additions & 0 deletions examples/scalpel-examples.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ category: Web
build-type: Simple
cabal-version: >=1.10

executable complex-predicates
default-language: Haskell2010
main-is: complex-predicates/Main.hs
build-depends:
base >= 4.6 && < 5
, scalpel >= 0.2.0
ghc-options: -W

executable example-from-documentation
default-language: Haskell2010
main-is: example-from-documentation/Main.hs
Expand All @@ -19,6 +27,14 @@ executable example-from-documentation
, scalpel >= 0.2.0
ghc-options: -W

executable generalized-repetition
default-language: Haskell2010
main-is: generalized-repetition/Main.hs
build-depends:
base >= 4.6 && < 5
, scalpel >= 0.2.0
ghc-options: -W

executable list-all-images
default-language: Haskell2010
main-is: list-all-images/Main.hs
Expand Down
7 changes: 4 additions & 3 deletions src/Text/HTML/Scalpel/Internal/Scrape.hs
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,10 @@ scrape s = scrapeOffsets s . tagWithOffset . TagSoup.canonicalizeTags
-- match every set of tags, use 'chroots'.
chroot :: (Ord str, TagSoup.StringLike str, Selectable s)
=> s -> Scraper str a -> Scraper str a
chroot selector (MkScraper inner) = MkScraper
$ join . (inner <$>)
. listToMaybe . select selector
chroot selector inner = do
maybeResult <- listToMaybe <$> chroots selector inner
guard (isJust maybeResult)
return $ fromJust maybeResult

-- | The 'chroots' function takes a selector and an inner scraper and executes
-- the inner scraper as if it were scraping a document that consists solely of
Expand Down
10 changes: 10 additions & 0 deletions tests/TestMain.hs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ module Main (main) where
import Text.HTML.Scalpel

import Control.Applicative
import Control.Monad (guard)
import Data.List (isInfixOf)
import System.Exit
import Test.HUnit

Expand Down Expand Up @@ -224,6 +226,14 @@ scrapeTests = "scrapeTests" ~: TestList [
"<a>foo</a><a>bar</a>"
(Just ["foo","bar"])
(innerHTMLs "a")

, scrapeTest
"<a>foo</a><a>bar</a><a>baz</a>"
(Just "<a>bar</a>")
(chroot "a" $ do
t <- text Any
guard ("b" `isInfixOf` t)
html Any)
]

scrapeTest :: (Eq a, Show a) => String -> Maybe a -> Scraper String a -> Test
Expand Down

0 comments on commit 9be1886

Please sign in to comment.