List-extractor - Extract Data from Wikipedia Lists

List-Extractor is a tool that can extract information from wikipedia lists and form appropriate RDF triples from the list data.

GSoC'16 Detailed Progress available here

Final commit of GSoC'16 can be found here

GSoC'17 Work's detailed progress available here

List-Extractor wiki available here

GSoC'17 Final results and challenges available here

How to run the tools

This project contains 2 differnt tools: List-Extractor and Rules-Generator. Use rulesGenerator.py first to generate desired rules, and then use listExtractor.py to extract triples for wiki resources. Alternatively, you can use only listExtractor.py and extract with existing default settings.

For more details, refer to the documentation present in the docs folder. The sample generated datasets can be found here. Some example triples for different domains are present in extracted folder.

List-Extractor:

python listExtractor.py [collect_mode] [source] [language] [-c class_name]

collect_mode : s or a
- use s to specify a single resource or a for a class of resources in the next parameter.
source: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer.
language: en, it, de etc. (for now, available only for some languages, for selected domains)
- a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
-c --classname: a string representing classnames you want to associate your resource with. Applicable only for collect_mode="s".

NOTE: While extracting triples from multiple resources in a domain (collect_mode = a), using Ctrl + C will skip the current resource and move on to the next resource. To quit the extractor, use Ctrl + \.

Examples:

python listExtractor.py a Writer it
python listExtractor.py s William_Gibson en : Uses the default inbuilt mapper-functions
python listExtractor.py s William_Gibson en -c CUSTOM_WRITER : Uses the CUSTOM_WRITER mapping only to extract list elements.

If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called extracted.

Rules-Generator:

python rulesGenerator.py

This is an interactive tool, select the options given in the menu for using the rules generator.
While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
Upon successful addition/modification, it will update the settings.json and custom_mapper.json so that the new user defined rules/functions can run with extractor.

Default Mapped Domains:

English (en):
- Person: Writer, Actor, MusicalArtist, Athelete, Polititcian, Manager, Coach, Celebrity etc.
- EducationalInstitution: University, School, College, Library
- PeriodicalLiterature: Magazines, Newspapers, AcademicJournals
- Group: Band
Other (it, de, es):
- Writer, Actor, MusicalArtist
More Domains can be added using the rulesGenerator.py tool.

Attributions for 3rd party tools:

This project uses 2 other existing open source projects.

JSONpedia, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.

The software is copyright of Michele Mostarda (me@michelemostarda.it) and released under Apache 2 License. Link : JSONpedia

JCommander, a very small Java framework that makes it trivial to parse command line parameters.

Contact Cédric Beust (cedric@beust.com) for more information. Released under Apache 2 License. Link : JCommander

Requirements

Python 2.7
RDFlib library
Stable internet connection

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.idea		.idea
3rd Party Licenses		3rd Party Licenses
Jsonpedia_Wrapper		Jsonpedia_Wrapper
docs		docs
extracted		extracted
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
custom_mappers.json		custom_mappers.json
evaluation.csv		evaluation.csv
jsonpedia_wrapper.jar		jsonpedia_wrapper.jar
listExtractor.py		listExtractor.py
mapper.py		mapper.py
mapping_rules.py		mapping_rules.py
rulesGenerator.py		rulesGenerator.py
settings.json		settings.json
utilities.py		utilities.py
wikiParser.py		wikiParser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

List-extractor - Extract Data from Wikipedia Lists

GSoC'16 Detailed Progress available here

Final commit of GSoC'16 can be found here

GSoC'17 Work's detailed progress available here

List-Extractor wiki available here

GSoC'17 Final results and challenges available here

How to run the tools

List-Extractor:

Examples:

Rules-Generator:

Default Mapped Domains:

Attributions for 3rd party tools:

Requirements

About

Releases

Packages

Contributors 3

Languages

License

dbpedia/list-extractor

Folders and files

Latest commit

History

Repository files navigation

List-extractor - Extract Data from Wikipedia Lists

How to run the tools

List-Extractor:

Examples:

Rules-Generator:

Default Mapped Domains:

Attributions for 3rd party tools:

Requirements

About

Resources

License

Stars

Watchers

Forks

Languages