-
Notifications
You must be signed in to change notification settings - Fork 38
Home
fizx edited this page Sep 13, 2010
·
11 revisions
Parsley is a simple language for extracting structured data from web pages. Parsley consists of an powerful Selector Language wrapped with a JSON Structure that can represent page-wide formatting.
Check out A Simple Tutorial to extract a CSV file of beers by brewery and rating in only a few lines of code.
Parsley has a Command-line Interface, Ruby Bindings, Python Bindings, and a C Interface, and can output to JSON, CSV, and XML.
The following parselet parses a Yelp business listing (no endorsement implied).
{
"name": "h1",
"phone": "#bizPhone",
"address": "address",
"reviews(.nonfavoriteReview)": [
{
"date": ".ieSucks .smaller",
"user_name": ".reviewer_info a",
"comment": "with-newlines(.review_comment)"
}
]
}
You can get JSON out by typing:
sh$: parsley businesses.let http://www.yelp.com/biz/amnesia-san-francisco
To get a site-wide crawl that will dump a businesses.csv, and a reviews.csv (with foreign key to businesses), run:
sh$: skivvies businesses.let http://www.yelp.com/biz/amnesia-san-francisco
It’s that easy.
Sites are for example purposes. Please obey robots.txt.