OpenWines scraper, is a simple Command Line Interface to scrap 100% genuine open-data, in order to generate fixtures dataset for OpenWines Products Information Manager.
It scraps open-data sources like Wikipedia, and returns full complete lists of Appellations, Wine Varietals (cépages), with one column per infobox attribute.
By using these command lines:
bin/scraper appellation > output/appellations.csv
bin/scraper cepage > output/cepages.csv
you turns these infoboxes:
into this kind of CSV file (26 columns, one per attribute):
It's a simple command line application, heavily inspired by Cilex.
It scraps Wikipedia URLs, parse structured content, and write CSV files. It uses 2 sources:
- infobox definitions from Wikipedia like this one for appellations
- and lists of URLs from Wikipedia like this CSV for appellations.
git clone
this repository.- Download composer:
curl -s https://getcomposer.org/installer | php
- Install dependencies:
php composer.phar install
For scraping a whole list of wikipedia URLs:
bin/scraper appellation > output/appellations.csv
bin/scraper cepage > output/cepages.csv
For a single entity:
bin/scraper appellation muscadet > output/muscadet.csv
bin/scraper cepage cabernet-sauvignon > output/cabernet-sauvignon.csv
Output examples:
- list of all appellations.csv
- list of all cepages.csv
- 1 appellation: muscadet.csv
- 1 cepage: cabernet-sauvignon.csv
Other available commands:
bin/scraper
bin/scraper info
bin/scraper help appellation
bin/scraper appellation muscadet
- Create your new commands in
src/OpenWines/Command/
- Add your new commands to
bin/
- Add new Wikipedia infobox models here
- Add more URLs to scrape here (need to make it a parameter, no done already)
- Download and install box:
curl -LSs https://box-project.github.io/box2/installer.php | php
chmod +x box.phar
mv box.phar /usr/local/bin/box
- Update the project phar config in box.json
- Create the package:
box build
- Run the commands:
./scraper.phar info
- enjoy a lot.
OpenWines Scraper is licensed under the Open Software License (OSL 3.0)
Q: How do I scrap other kind of infoboxes from other Wikipedia URLs?
A: You've got 4 files to add in order to scrap your data:
- +1 new InfoBoxType here
- +1 new Console Command here
- +1 new Wikipedia infobox model here(click on the (?) link in Wikipedia Infoboxes to retrieve it)
- +1 new Wikipedia URLs CSV file here
Q: How do I contribute to Wikidata?
A: I don't know how, but more people ask for that. So feel free to contribute!