Skip to content
UnderpantsGnome edited this page Sep 12, 2010 · 3 revisions

HpricotScrub

Hpricot Scrub is a wrapper around Hpricot that adds methods to scrub HTML tags from a document.

To Install


gem install hrpicot_scrub
</pre>

Now you can use the following to remove all tags from an HTML doc


require 'rubygems'
require 'hpricot_scrub'

doc = Hpricot(open(‘http://slashdot.org/’).read)
text = doc.scrub

Scrub the doc based on a config hash ([source:/examples/config.yml sample config])


doc.scrub(hash)
</pre>

Strip all hrefs, leaving the text inside in tact


(doc/:a).strip
</pre>

The gem version also has a couple of new convenience methods on String


String#scrub(config={})
String#scrub!(config={})
</pre>

>> str = '<a href="http://example.com/">example.com</a>'
=> "<a href="http://example.com/">example.com</a>"
>> str.scrub
=> "example.com"
>> str
=> "<a href="http://example.com/">example.com</a>"
>> str.scrub!
=> "example.com"
>> str
=> "example.com"
</pre>
Clone this wiki locally