Skip to content

mssammon/illinois-cross-lingual-wikifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Illinois Cross-Lingual Wikifier

Given a piece of text in any language, a cross-lingual wikifier identifies mentions of named entities and grounds them to the corresponding entries in the English Wikipedia. This project implements the approaches proposed in the following two papers:

This demo will give you some intuition about this project.

Setup

Download this file which contains MapDB indices of FreeBase dump and English, Spanish, and Chinese Wikipedia. Follow the README inside to extract the files and set the corresponding paths in the config file.

Note that we currently only release the resources for these three languages.

For CogComp members, if you want to know where are the resources for other languages, please contact me.

Run Benchmark

mvn dependency:copy-dependencies
mvn compile
./scripts/run-benchmark.sh es config/xlwikifier-tac.config

This script evaluates Spanish and Chinese performnace on TAC-KBP 2016 EDL shared task. You need to specify the paths to the test documents and the gold annotations in the config file. Check config/xlwikifier-tac.config for example. These documents are in the original format provided by LDC. You will get the following performance on named entities:

Spanish 
strong mention match:       Precision:0.880 Recall:0.801 F1:0.838
strong typed mention match: Precision:0.854 Recall:0.778 F1:0.814
mention ceaf:               Precision:0.814 Recall:0.740 F1:0.775

Chinese
strong mention match:       Precision:0.868 Recall:0.724 F1:0.789
strong typed mention match: Precision:0.835 Recall:0.696 F1:0.759
mention ceaf:               Precision:0.814 Recall:0.678 F1:0.740

Train NER Model

Use ./scripts/train-ner.sh to train Illinois NER models with wikifier features. Note that the training and test files should be in the column format.

Train Wikifier Ranking Model

Requirements:

  • Download and build the ranking version of liblinear. In the config file of cross-lingual wikifier, set the "liblinear_path" to the liblinear folder which contains the binary file "train".
  • The path to the processed Wikipedia dumps needs to be set in the config. This is only available on CogComp machines now.
./scripts/train-ranker.sh es config/xlwikifier-tac.config

This script trains ranking models using Wikipedia articles. The resulting model is saved at the location specified in the config file.

Contact

Chen-Tse Tsai (ctsai12@illinois.edu)

Releases

No releases published

Packages

No packages published

Languages

  • Java 98.7%
  • Shell 1.3%