Given a piece of text in any language, a cross-lingual wikifier identifies mentions of named entities and grounds them to the corresponding entries in the English Wikipedia. This project implements the approaches proposed in the following two papers:
- Cross-Lingual Wikification Using Multilingual Embeddings (Tsai and Roth, NAACL 2016)
- Cross-Lingual Named Entity Recognition via Wikification (Tsai et al., CoNLL 2016)
This demo will give you some intuition about this project.
Download this file which contains MapDB indices of FreeBase dump and English, Spanish, and Chinese Wikipedia. Follow the README inside to extract the files and set the corresponding paths in the config file.
Note that we currently only release the resources for these three languages.
For CogComp members, if you want to know where are the resources for other languages, please contact me.
mvn dependency:copy-dependencies
mvn compile
./scripts/run-benchmark.sh es config/xlwikifier-tac.config
This script evaluates Spanish and Chinese performnace on TAC-KBP 2016 EDL shared task. You need to specify the paths to the test documents and the gold annotations in the config file. Check config/xlwikifier-tac.config for example. These documents are in the original format provided by LDC. You will get the following performance on named entities:
Spanish
strong mention match: Precision:0.880 Recall:0.801 F1:0.838
strong typed mention match: Precision:0.854 Recall:0.778 F1:0.814
mention ceaf: Precision:0.814 Recall:0.740 F1:0.775
Chinese
strong mention match: Precision:0.868 Recall:0.724 F1:0.789
strong typed mention match: Precision:0.835 Recall:0.696 F1:0.759
mention ceaf: Precision:0.814 Recall:0.678 F1:0.740
Use ./scripts/train-ner.sh to train Illinois NER models with wikifier features. Note that the training and test files should be in the column format.
Requirements:
- Download and build the ranking version of liblinear. In the config file of cross-lingual wikifier, set the "liblinear_path" to the liblinear folder which contains the binary file "train".
- The path to the processed Wikipedia dumps needs to be set in the config. This is only available on CogComp machines now.
./scripts/train-ranker.sh es config/xlwikifier-tac.config
This script trains ranking models using Wikipedia articles. The resulting model is saved at the location specified in the config file.
Chen-Tse Tsai (ctsai12@illinois.edu)