Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi! I cleaned up your code for you! #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions INSTALL
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Welcome to Parsley!

Parsley depends on
Parsley depends on
- argp (standard with Linux, other platforms use argp-standalone package)
- the JSON C library from http://oss.metaparadigm.com/json-c/ (I used 0.8)
- pcre (with dev headers)
Expand All @@ -12,17 +12,17 @@ Here's how to install it:

1. Get the release
------------------------------------------------------------------------
Parsley is currently still being tracked in git, and isn't ready to make a
Parsley is currently still being tracked in git, and isn't ready to make a
formal release. So you need to either clone or download the latest tarball:

git clone git://github.com/fizx/parsley.git
or
or
wget http://github.com/fizx/parsley/tarball/master


2. Build for your platform
------------------------------------------------------------------------
Enter your parsley working directory, (from the clone or download you
Enter your parsley working directory, (from the clone or download you
just made) and, based on your platform, do the following:


Expand Down Expand Up @@ -56,7 +56,7 @@ make
sudo make install

If you have a few extra minutes, consider replacing the last make with a
'make check' and let us know if it reports any failures from the test
'make check' and let us know if it reports any failures from the test
suite - thanks!

3. Ruby Binding (via Gems)
Expand Down
18 changes: 9 additions & 9 deletions INTRO
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ In order to make this easy to learn, let's keep the best of what's working today

Now for some examples:

- 3rd paragraph:
- 3rd paragraph:
p:nth-child(3)
- First sentence in that paragraph (period-delimited):
substring-before(p:nth-child(3), '.')
- Any simple phone number in an ordered list called "numbers"
re:match(ul#numbers>li, '\d{3}-\d{4}', 'g')

We support all of CSS3, XPath1, as well as all functions in XSLT 1.0 and EXSLT (required+regexp).

I think this is a pretty good way to grab a single piece of data from a page. It's simple and gives you all of the tools (CSS for simplicity, XPath for power, regex for detailed text handling) you are used to, in one expression.

We'd like to make our scraper script both portable and fast. For both these reasons, we need to be able to express the structure of the scraped data independently of the general-purpose programming language you happen to be working in. Jumping from XPath to Python and back means multiple passes over the document, and Python idioms prevent easy use of your scraper by Rubyists. If we can represent the entire scrape in a language-independent way, we can compile it into something that libxml2 can handle in one pass, giving screaming-fast (milliseconds per parse) performance.

To describe the output structure, lets use json. It's compact, and the Ruby/Python/etc bindings can use hashes/lists/dictionaries to represent the same structure. We can also have the scraper output json or native data structures. Here's an example script that grabs the title and all hyperlinks on a page:
Expand All @@ -36,21 +36,21 @@ To describe the output structure, lets use json. It's compact, and the Ruby/Pyt
"title": "h1",
"links": ["a"]
}

Applying this to http://www.yelp.com/biz/amnesia-san-francisco yields:

{
"title": "Amnesia",
"links": ["Yelp", "Welcome", "About Me", ... ]
}

You'll note that the output structure mirrors the input structure. In the Ruby binding, you can get both input and output natively:

> require "open-uri"
> require "parsley"
> Parsley.new({"title" => "h1", "links" => ["a"]}).parse(:url => "http://www.yelp.com/biz/amnesia-san-francisco")
#=> {"title"=>"Amnesia", "links"=>["Yelp", "Welcome", "About Me"]}

We'll also add both explicit and implicit grouping Here's an extension of the previous example with explicit grouping:

{
Expand All @@ -60,7 +60,7 @@ We'll also add both explicit and implicit grouping Here's an extension of the p
"link": "@href"
}]
}

The json structure in the output still mirrors the input, but now you can get both the link text and the href.

Pages like craigslist are slightly trickier to group. Elements on this page go h4, p, p, p, h4, p, p, p. To group this, you could do:
Expand All @@ -72,13 +72,13 @@ Pages like craigslist are slightly trickier to group. Elements on this page go
}]
}

If you instead wanted to group by date, you could use implicit grouping. It's implicit, because the parenthesized filter is omitted. Grouping happens by page order. We treat the first single (i.e. non-square-bracketed) value (the h4 in the below example) as the beginning of a new group, and adds following values to the group (i.e.: [h4, p, p, p], [h4, p, p], [h4, p]).
If you instead wanted to group by date, you could use implicit grouping. It's implicit, because the parenthesized filter is omitted. Grouping happens by page order. We treat the first single (i.e. non-square-bracketed) value (the h4 in the below example) as the beginning of a new group, and adds following values to the group (i.e.: [h4, p, p, p], [h4, p, p], [h4, p]).

{
"entry":[{
"date": "h4",
"title": ["p"]
}]
}

</textarea></html>
3 changes: 1 addition & 2 deletions Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ profile:

install-all:
./bootstrap.sh && ./configure && make && make install && cd ruby && rake install && cd ../python && python setup.py install

bench:
@echo "yelp..."; ./parsley test/yelp.let test/yelp.html > /dev/null
@echo "craigs-simple..."; ./parsley test/craigs-simple.let test/craigs-simple.html > /dev/null
Expand Down Expand Up @@ -73,4 +73,3 @@ check-am:
@echo "default-namespace..."; ./parsley -x test/default-namespace.let test/default-namespace.xml 2>&1 | diff test/default-namespace.json - && echo " success."
@echo "sg-wrap..."; ./parsley -s test/sg-wrap.let test/sg-wrap.html 2>&1 | diff test/sg-wrap.json - && echo " success."
@echo "collate_regression..."; ./parsley test/collate_regression.let test/collate_regression.html 2>&1 | diff test/collate_regression.json - && echo " success."

6 changes: 3 additions & 3 deletions Makefile.in
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ parser.h: parser.c
rm -f parser.c; \
$(MAKE) $(AM_MAKEFLAGS) parser.c; \
else :; fi
libparsley.la: $(libparsley_la_OBJECTS) $(libparsley_la_DEPENDENCIES)
libparsley.la: $(libparsley_la_OBJECTS) $(libparsley_la_DEPENDENCIES)
$(LINK) -rpath $(libdir) $(libparsley_la_OBJECTS) $(libparsley_la_LIBADD) $(LIBS)
install-binPROGRAMS: $(bin_PROGRAMS)
@$(NORMAL_INSTALL)
Expand Down Expand Up @@ -372,10 +372,10 @@ clean-binPROGRAMS:
list=`for p in $$list; do echo "$$p"; done | sed 's/$(EXEEXT)$$//'`; \
echo " rm -f" $$list; \
rm -f $$list
parsley$(EXEEXT): $(parsley_OBJECTS) $(parsley_DEPENDENCIES)
parsley$(EXEEXT): $(parsley_OBJECTS) $(parsley_DEPENDENCIES)
@rm -f parsley$(EXEEXT)
$(LINK) $(parsley_OBJECTS) $(parsley_LDADD) $(LIBS)
parsleyc$(EXEEXT): $(parsleyc_OBJECTS) $(parsleyc_DEPENDENCIES)
parsleyc$(EXEEXT): $(parsleyc_OBJECTS) $(parsleyc_DEPENDENCIES)
@rm -f parsleyc$(EXEEXT)
$(LINK) $(parsleyc_OBJECTS) $(parsleyc_LDADD) $(LIBS)

Expand Down
2 changes: 1 addition & 1 deletion PAPER
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Features
Examples
- Ruby/python/json
- structural parse
-
-

Benchmarks
- size comparision with XSLT
Expand Down
2 changes: 1 addition & 1 deletion Portfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@ depends_lib port:argp-standalone \
port:json-c \
port:libxslt \
port:pcre

checksums md5 5e4d9080aa4ed2dfa7996c89a8e7f719 sha1 9508eea67212d9a9620eac3fe3719c91e00e11d9 rmd160 dfa9cee2fdb41ac750d47288d5128f1963a84334
2 changes: 1 addition & 1 deletion Portfile.in
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ depends_lib port:argp-standalone \
port:json-c \
port:libxslt \
port:pcre

46 changes: 23 additions & 23 deletions README.C-LANG
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
To use parsley from C, the following functions are available from parsley.h. In
addition, there is a function to convert xml documents of the type returned by
parsley into json.
parsley into json.

You will also need passing familiarity with libxml2 and json-c to print, manipulate, and free some of the generated objects.

Expand All @@ -19,23 +19,23 @@ parsleyPtr parsley_compile(char* parsley, char* incl)

Arguments:
- char* parsley -- a string of parsley to compile.
- char* incl -- arbitrary XSLT to inject directly into the stylesheet,
- char* incl -- arbitrary XSLT to inject directly into the stylesheet,
outside any templates.

Returns: A structure that you can pass to parsley_parse_* to do the actual
parsing. This structure contains the compiled XSLT.
Notes: This is *NOT* thread-safe. (Usage of the parselet via parsley_parse_* *IS*

Notes: This is *NOT* thread-safe. (Usage of the parselet via parsley_parse_* *IS*
thread-safe, however.)

void parsley_set_user_agent(char *);

Sets the user-agent used by parsley's internal http library.

void parsley_free(parsleyPtr);

Frees the parsleyPtr's memory.

void parsed_parsley_free(parsedParsleyPtr);

Frees the parsedParsleyPtr's memory.
Expand All @@ -54,39 +54,39 @@ parsedParsleyPtr parsley_parse_file(parsleyPtr parsley, char* file_name, int fla
PARSLEY_OPTIONS_COLLATE = 16,
PARSLEY_OPTIONS_SGWRAP = 32
};
Returns: A libxml2 document of the extracted data. You need to free this
with xmlFree(). To output, look at the libxml2 documentation for functions
like xmlSaveFormatFile(). If you want json output, look below for xml2json
docs.

Returns: A libxml2 document of the extracted data. You need to free this
with xmlFree(). To output, look at the libxml2 documentation for functions
like xmlSaveFormatFile(). If you want json output, look below for xml2json
docs.

parsedParsleyPtr parsley_parse_string(parsleyPtr parsley, char* string, size_t len, char * base_uri, int flags);

Parses the in-memory string/length combination given. See parsley_parse_file
Parses the in-memory string/length combination given. See parsley_parse_file
docs.

parsedParsleyPtr parsley_parse_doc(parsleyPtr parsley, xmlDocPtr doc, bool prune);

Uses the parsley parser to parse a libxml2 document.
Uses the parsley parser to parse a libxml2 document.

From xml2json.h
===============

struct json_object * xml2json(xmlNodePtr);

Converts an xml subtree to json. The xml should be in the format returned
by parsley. Basically, xml attributes get ignored, and if you want an array
by parsley. Basically, xml attributes get ignored, and if you want an array
like [a,b], use:
<parsley:groups>

<parsley:groups>
<parsley:group>a</parsley:group>
<parsley:group>b</parsley:group>
</parsley:groups>

To get a null-terminated string out, use:

json_object_to_json_string(struct json_object *)

To free (actually, to decrement the reference count), call:

json_object_put(struct json_object *)
2 changes: 1 addition & 1 deletion TODO
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,6 @@
- saxon compatibility?!
- XML input converter?!
- check windows build
- flags?!
- flags?!
^ - force group-before
$ - force group-after
2 changes: 1 addition & 1 deletion aclocal.m4
Original file line number Diff line number Diff line change
Expand Up @@ -7899,7 +7899,7 @@ _LT_DECL(, macro_revision, 0)
# included after everything else. This provides aclocal with the
# AC_DEFUNs it wants, but when m4 processes it, it doesn't do anything
# because those macros already exist, or will be overwritten later.
# We use AC_DEFUN over AU_DEFUN for compatibility with aclocal-1.6.
# We use AC_DEFUN over AU_DEFUN for compatibility with aclocal-1.6.
#
# Anytime we withdraw an AC_DEFUN or AU_DEFUN, remember to add it here.
# Yes, that means every name once taken will need to remain here until
Expand Down
28 changes: 14 additions & 14 deletions functions.c
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ void parsley_register_all(){
xsltInnerXmlFunction);
}

static void
static void
xsltStarXMLFunction(xmlXPathParserContextPtr ctxt, int nargs, bool is_inner) {
if (nargs != 1) {
xsltTransformError(xsltXPathGetTransformContext(ctxt), NULL, NULL,
Expand Down Expand Up @@ -208,16 +208,16 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
"document() : internal error tctxt == NULL\n");
valuePush(ctxt, xmlXPathNewNodeSet(NULL));
return;
}
}

uri = xmlParseURI((const char *) URI);
if (uri == NULL) {
xsltTransformError(tctxt, NULL, NULL,
"document() : failed to parse URI\n");
valuePush(ctxt, xmlXPathNewNodeSet(NULL));
return;
}
}

/*
* check for and remove fragment identifier
*/
Expand All @@ -231,12 +231,12 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
} else
idoc = xsltLoadHtmlDocument(tctxt, URI);
xmlFreeURI(uri);

if (idoc == NULL) {
if ((URI == NULL) ||
(URI[0] == '#') ||
((tctxt->style->doc != NULL) &&
(xmlStrEqual(tctxt->style->doc->URL, URI))))
(xmlStrEqual(tctxt->style->doc->URL, URI))))
{
/*
* This selects the stylesheet's doc itself.
Expand All @@ -257,7 +257,7 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
valuePush(ctxt, xmlXPathNewNodeSet((xmlNodePtr) doc));
return;
}

/* use XPointer of HTML location for fragment ID */
#ifdef LIBXML_XPTR_ENABLED
xptrctxt = xmlXPtrNewContext(doc, NULL, NULL);
Expand All @@ -270,11 +270,11 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
resObj = xmlXPtrEval(fragment, xptrctxt);
xmlXPathFreeContext(xptrctxt);
#endif
xmlFree(fragment);
xmlFree(fragment);

if (resObj == NULL)
goto out_fragment;

switch (resObj->type) {
case XPATH_NODESET:
break;
Expand All @@ -288,11 +288,11 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
case XPATH_RANGE:
case XPATH_LOCATIONSET:
xsltTransformError(tctxt, NULL, NULL,
"document() : XPointer does not select a node set: #%s\n",
"document() : XPointer does not select a node set: #%s\n",
fragment);
goto out_object;
}

valuePush(ctxt, resObj);
return;

Expand All @@ -303,7 +303,7 @@ xsltHtmlDocumentFunctionLoadDocument(xmlXPathParserContextPtr ctxt, xmlChar* URI
valuePush(ctxt, xmlXPathNewNodeSet(NULL));
}

xsltDocumentPtr
xsltDocumentPtr
xsltLoadHtmlDocument(xsltTransformContextPtr ctxt, const xmlChar *URI) {
xsltDocumentPtr ret;
xmlDocPtr doc;
Expand All @@ -316,7 +316,7 @@ xsltLoadHtmlDocument(xsltTransformContextPtr ctxt, const xmlChar *URI) {
*/
if (ctxt->sec != NULL) {
int res;

res = xsltCheckRead(ctxt->sec, ctxt, URI);
if (res == 0) {
xsltTransformError(ctxt, NULL, NULL,
Expand Down
Loading