HtmlScraper_how_to

Html Scrapping in Squirrel is available by the HtmlScraperAnalyzer.

As an analyzer, the scraper will use yaml files for each one of the websites that will be scraped, containing all syntax to found elements in the page. It is necessary to declare the env variable HTML_SCRAPER_YAML_PATH, pointing to the folder where the yaml files are stored. If the variable is not present, the HtmlScraperAnalyzer will not do anything.

An example of yaml scraping file is described as it follows:

file_descriptor:
 check:
  domain: europeandataportal.eu
  ignore-request: false
 search-result-page:
  regex: dataset?q=
  resources: 
   "$uri":
    "http://sindice.com/vocab/search#link": .dataset-list h3 a
    "http://sindice.com/vocab/search#pagination": .pagination a
 download_page:
  regex: dataset/
  resources:
   "http://dice-research.squirrel.de/dataset_$label":
    "http://purl.org/dc/terms/title": .secondary section .heading:eq(0)
    "http://purl.org/dc/terms/description": .notes p
    "http://purl.org/dc/terms/distribution": l(http://dice-research.squirrel.de/distribution_$label)
    "http://purl.org/dc/terms/issued": table tr:contains(Created) td
    "http://purl.org/dc/terms/theme": l(http://mcloud.projekt-opal.de/theme_$label)
    "http://purl.org/dc/terms/publisher": l(http://mcloud.projekt-opal.de/agent_$label)
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": l(http://www.w3.org/ns/dcat#Dataset)

    "http://dice-research.squirrel.de/distribution_$label": 
     "http://www.w3.org/ns/dcat#downloadURL*": .resource-item:not(:contains(wfs)):not(:contains(wms)):not(:contains(FTP)) [title*=go to resource]
     "http://www.w3.org/ns/dcat#accessURL*": .resource-item:contains(wms),a:contains(wfs),a:contains(FTP) [title*=go to resource]
     "http://purl.org/dc/terms/license": .license p a:not([rel])
     "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": l(http://www.w3.org/ns/dcat#Distribution)

The file must have the file_descriptor key, as well the check value, containing the domain of the portal that will be scraped. If this structure is not followed, an exception will be throw. The analyzer will check a map of loaded domains, related to found files in HTML_SCRAPER_YAML_PATH

For each one of the pages that will be scraped, should be included two values: regex and resources. If the file does not follow this hierarchy, an exception will be throw. These values are described as follows:

regex: the substring of the page's context that could match the URI

resources: a list of resources, predicates and the respective objects. For querying elements, the scraper uses the CSS selector from Jsoup (https://jsoup.org. It uses a jquery-like selector syntax to select html components instead of Xpath, but it is very simply to use. For syntax reference, please look at https://jsoup.org/apidocs/org/jsoup/select/Selector.html. You can use some variables to define the name of your resources. $uri will use the current page's url. $label will use the last context of the url.

If one of the triples may reference many other resources, define the object with * in the end, like "http://dice-research.squirrel.de/distribution_$label*". In the predicates of that resource, you use the asterisk in the end of then to define which predicate will define the iteration. In the example above "http://www.w3.org/ns/dcat#downloadURL" and "http://www.w3.org/ns/dcat#accessURL" will not be presented together (because of the query for both). So, in this case, if multiple downloadUrl is found, ""http://dice-research.squirrel.de/distribution_$label" will be iterated over the number of elements found, the same for accessURL. The other predicates without asterisk will be static for all elements found. Also, n iterations will be created for the reference triple as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HtmlScraper_how_to

Clone this wiki locally