Skip to content

Latest commit

 

History

History
119 lines (80 loc) · 3.42 KB

README.md

File metadata and controls

119 lines (80 loc) · 3.42 KB

OpenWines Open-Data scraper

OpenWines scraper, is a simple Command Line Interface to scrap 100% genuine open-data, in order to generate fixtures dataset for OpenWines Products Information Manager.

What is does

It scraps open-data sources like Wikipedia, and returns full complete lists of Appellations, Wine Varietals (cépages), with one column per infobox attribute.

By using these command lines:

bin/scraper appellation > output/appellations.csv  
bin/scraper cepage > output/cepages.csv  

you turns these infoboxes:

infobox

into this kind of CSV file (26 columns, one per attribute):

csv

What is is

It's a simple command line application, heavily inspired by Cilex.

It scraps Wikipedia URLs, parse structured content, and write CSV files. It uses 2 sources:

How to install it

  1. git clone this repository.
  2. Download composer: curl -s https://getcomposer.org/installer | php
  3. Install dependencies: php composer.phar install

How to use it

For scraping a whole list of wikipedia URLs:

bin/scraper appellation > output/appellations.csv  
bin/scraper cepage > output/cepages.csv  

For a single entity:

bin/scraper appellation muscadet > output/muscadet.csv
bin/scraper cepage cabernet-sauvignon > output/cabernet-sauvignon.csv

Output examples:

Other available commands:

bin/scraper 
bin/scraper info
bin/scraper help appellation
bin/scraper appellation muscadet

How to hack it

  • Create your new commands in src/OpenWines/Command/
  • Add your new commands to bin/
  • Add new Wikipedia infobox models here
  • Add more URLs to scrape here (need to make it a parameter, no done already)

How to package it (in a PHAR)

  • Download and install box:
curl -LSs https://box-project.github.io/box2/installer.php | php
chmod +x box.phar
mv box.phar /usr/local/bin/box
  • Update the project phar config in box.json
  • Create the package:
box build
  • Run the commands:
./scraper.phar info
  • enjoy a lot.

How to reuse it

OpenWines Scraper is licensed under the Open Software License (OSL 3.0)

FAQ

Q: How do I scrap other kind of infoboxes from other Wikipedia URLs?

A: You've got 4 files to add in order to scrap your data:

  • +1 new InfoBoxType here
  • +1 new Console Command here
  • +1 new Wikipedia infobox model here(click on the (?) link in Wikipedia Infoboxes to retrieve it)
  • +1 new Wikipedia URLs CSV file here

Q: How do I contribute to Wikidata?

A: I don't know how, but more people ask for that. So feel free to contribute!