#WebScraper
##Config file specification
Main body consists of one json object with list of pages as properties:
- page_name - name of your choice for this type of page
{
%page_name%: {
...
},
...
}
Each page property can have 5 items:
-
test - string of conditions which identify this page type; make sure that these conditions satisfy only one page type
-
pageLinks - list of children pages with a selector to element from which url can be extracted
-
properties - list of properties to be extracted from this page type with a selector to wanted element
-
languages - list of links to this page in other languages
-
pagination - selector to element containing next page link if page has pagination
"test": [%condition%,...],
"pageLinks":{
%name%: %selector%,
...
},
"properties":{
%name%: %extractor%:%selector%
},
"languages":{
%lang_identifier%: %selector%
},
"pagination":%selector%
Conditions:
Access page object:
doc
Get element using Css selector or XPath:
.Css(...)
.XPath(...)
Access element inner text and work with it:
.InnerText = "..."
InnerText
is of type string, so you can access/call string properties/methods:
.InnerText.StartsWith(...)
Check url if it contains value:
doc.UrlContains(...)
Check page language:
doc.Language = ...
Extractors:
-
innertext extractor - extracts inner text only
-
innerhtml extractor - extracts inner html of an element; all content of and element (e.g. images) is downloaded and path to local temporary folder is put in href place
-
outerhtml extractor - extracts outer html of an element; all content of and element (e.g. images) is downloaded and path to local temporary folder is put in href place
-
image extractor - downloads image in img tag to temporary folder and prints path to output file
-
download extractor - downloads item in anchor tag to temporary folder and prints path to output file
Selectors:
Both CSS and XPath are valid selectors. However XPath has to be written in special format (except when using .XPath(...) condition):
*[xpath>'%path%']
where %path% is valid XPath. If XPath contains quotes write them in json as escaped double quotes (\").