List-Extractor is a tool that can extract information from wikipedia lists and form appropriate RDF triples from the list data.
This project contains 2 differnt tools: List-Extractor
and Rules-Generator
.
Use rulesGenerator.py
first to generate desired rules, and then use listExtractor.py
to extract triples for wiki resources.
Alternatively, you can use only listExtractor.py
and extract with existing default settings.
For more details, refer to the documentation present in the docs
folder. The sample generated datasets can be found here. Some example triples for different domains are present in extracted
folder.
python listExtractor.py [collect_mode] [source] [language] [-c class_name]
-
collect_mode
:s
ora
- use
s
to specify a single resource ora
for a class of resources in the next parameter.
- use
-
source
: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer. -
language
:en
,it
,de
etc. (for now, available only for some languages, for selected domains)- a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
-
-c --classname
: a string representing classnames you want to associate your resource with. Applicable only forcollect_mode="s"
.
NOTE: While extracting triples from multiple resources in a domain (collect_mode = a
), using Ctrl + C
will skip the current resource and move on to the next resource. To quit the extractor, use Ctrl + \
.
python listExtractor.py a Writer it
python listExtractor.py s William_Gibson en
: Uses the default inbuilt mapper-functionspython listExtractor.py s William_Gibson en -c CUSTOM_WRITER
: Uses theCUSTOM_WRITER
mapping only to extract list elements.
If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called extracted
.
python rulesGenerator.py
- This is an interactive tool, select the options given in the menu for using the rules generator.
- While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
- Upon successful addition/modification, it will update the
settings.json
andcustom_mapper.json
so that the new user defined rules/functions can run with extractor.
-
English (
en
):- Person:
Writer
,Actor
,MusicalArtist
,Athelete
,Polititcian
,Manager
,Coach
,Celebrity
etc. - EducationalInstitution:
University
,School
,College
,Library
- PeriodicalLiterature:
Magazines
,Newspapers
,AcademicJournals
- Group:
Band
- Person:
-
Other (
it
,de
,es
):Writer
,Actor
,MusicalArtist
-
More Domains can be added using the
rulesGenerator.py
tool.
This project uses 2 other existing open source projects.
- JSONpedia, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.
The software is copyright of Michele Mostarda (me@michelemostarda.it) and released under Apache 2 License. Link : JSONpedia
- JCommander, a very small Java framework that makes it trivial to parse command line parameters.
Contact Cédric Beust (cedric@beust.com) for more information. Released under Apache 2 License. Link : JCommander
- Python 2.7
- RDFlib library
- Stable internet connection