-
Notifications
You must be signed in to change notification settings - Fork 19
HtmlScraper_how_to
Html Scrapping in Squirrel is available by the HtmlScraperAnalyzer.
As an analyzer, the scraper will use yaml files for each one of the websites that will be scraped, containing all syntax to found elements in the page. It is necessary to declare the env variable HTML_SCRAPER_YAML_PATH, pointing to the folder where the yaml files are stored. If the variable is not present, the HtmlScraperAnalyzer will not do anything.
An example of yaml scraping file is described as it follows:
file_descriptor:
check:
domain: europeandataportal.eu
ignore-request: false
search-result-page:
regex: dataset?q=
resources:
"$uri":
"http://sindice.com/vocab/search#link": .dataset-list h3 a
"http://sindice.com/vocab/search#pagination": .pagination a
download_page:
regex: dataset/
resources:
"http://dice-research.squirrel.de/dataset_$label":
"http://purl.org/dc/terms/title": .secondary section .heading:eq(0)
"http://purl.org/dc/terms/description": .notes p
"http://purl.org/dc/terms/distribution": l(http://dice-research.squirrel.de/distribution_$label)
"http://purl.org/dc/terms/issued": table tr:contains(Created) td
"http://purl.org/dc/terms/theme": l(http://mcloud.projekt-opal.de/theme_$label)
"http://purl.org/dc/terms/publisher": l(http://mcloud.projekt-opal.de/agent_$label)
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": l(http://www.w3.org/ns/dcat#Dataset)
"http://dice-research.squirrel.de/distribution_$label":
"http://www.w3.org/ns/dcat#downloadURL*": .resource-item:not(:contains(wfs)):not(:contains(wms)):not(:contains(FTP)) [title*=go to resource]
"http://www.w3.org/ns/dcat#accessURL*": .resource-item:contains(wms),a:contains(wfs),a:contains(FTP) [title*=go to resource]
"http://purl.org/dc/terms/license": .license p a:not([rel])
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": l(http://www.w3.org/ns/dcat#Distribution)
For each one of the pages that will be scraped, should be included two values: regex and resources. If the file does not follow this hierarchy, an exception will be throw. These values are described as follows:
regex: the substring of the page's context that could match the URI
resources: a list of resources, predicates and the respective objects. For querying elements, the scraper uses the CSS selector from Jsoup (https://jsoup.org. It uses a jquery-like selector syntax to select html components instead of Xpath, but it is very simply to use. For syntax reference, please look at https://jsoup.org/apidocs/org/jsoup/select/Selector.html. You can use some variables to define the name of your resources. $uri will use the current page's url. $label will use the last context of the url.
If one of the triples may reference many other resources, define the object with * in the end, like "http://dice-research.squirrel.de/distribution_$label*". In the predicates of that resource, you use the asterisk in the end of then to define which predicate will define the iteration. In the example above "http://www.w3.org/ns/dcat#downloadURL" and "http://www.w3.org/ns/dcat#accessURL" will not be presented together (because of the query for both). So, in this case, if multiple downloadUrl is found, ""http://dice-research.squirrel.de/distribution_$label" will be iterated over the number of elements found, the same for accessURL. The other predicates without asterisk will be static for all elements found. Also, n iterations will be created for the reference triple as well