-
You can create datasets from Wikia/Wikipedia that can be used for both of entity recognition and Entity Linking.
-
Sample Dataset is available here. See also preprocessed data examples.
- Ongoing under branch
feature/FixEnParseBug
.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
$ (install wikiextractor==3.0.5 from source https://github.com/attardi/wikiextractor for activate --json option.)
-
Download [worldname]_pages_current.xml from wikia statistics page to
./dataset/
.- For example, if you are interested in Virtual Youtuber, download
virtualyoutuber_pages_current.xml
dump from here.
- For example, if you are interested in Virtual Youtuber, download
$ sh ./scripts/vtuber.sh
-
-augmentation_with_title_set_string_match
(Default:True
)- When this parameter is
True
, first we construct title set from entire pages in one wikia.xml
. Then, when string matches in this title set, we treat these mentions as annotated ones.
- When this parameter is
-
-in_document_augmentation_with_its_title
(Default:True
)-
When this parameter is
True
, we add another annotation to dataset with distant supervision from title, where the mention appears. -
For example, the page of Anakin Skywalker mentions him without anchor link, as Anakin or Skywalker.
-
With this parameter on, we treat these mentions as annotated ones.
-
-
-spacy_model
(Default:en_core_web_md
)-
Specify spaCy model for sentence boundary detection.
-
Note: SBD with spaCy is conducted only when
-multiprocessing
isFalse
.
-
-
-language
(Default:en
) -
-multiprocessing
(Default:False
)- If
True
, documents after preprocessing with wikiextractor are multiprocessed.
- If
- Dataset was constructed using Wikias (from FANDOM) and Wikipedia. This dataset is licensed under the Creative Commons Attribution-Share Alike License (CC-BY-SA).
Preprocessed data example from Wikia.
key | its_content |
---|---|
document_title |
Page title where the annotation exists. |
anchor_sent |
Anchored sentence with <a> and </a> . This anchor can be used for Entity Linking. |
annotation_doc_entity_title |
Which entity to be linked if the mention is disambiguated. Redirects are also considered. |
mention |
Surface form as it is in sentence where the mention appeared. |
original_sentence |
Sentence without anchors. |
original_sentence_mention_start |
Mention span start position in original sentence. |
original_sentence_mention_end |
Mention span end position in original sentence. |
- For instance, a real-world example of
annotations.json
is shown from virtualyoutuber wikia.
[
{
"document_title": "Melissa Kinrenka",
"anchor_sent": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of <a> Nijisanji </a>.",
"annotation_doc_entity_title": "Nijisanji",
"mention": "Nijisanji",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 75,
"original_sentence_mention_end": 84
},
{
"document_title": "Melissa Kinrenka",
"anchor_sent": "<a> Melissa Kinrenka </a> (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"annotation_doc_entity_title": "Melissa Kinrenka",
"mention": "Melissa Kinrenka",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 0,
"original_sentence_mention_end": 16
},
...
]
...
- Redirect-resolved title and its descriptions after sentence split are available.
{
"Furen E Lustario": [
"Furen E Lustario (フレン・E・ルスタリオ) is a female Japanese Virtual YouTuber and member of Nijisanji.",
"A female knight of the Corvus Empire.",
"Introduction Video.",
"Furen's introduction.",
"Personality.",
"Furen lacks a surprising amount of common sense.",
"It has been displayed in at least two streams that she cannot tell from left to right.",
...
],
"Ibrahim": [
"Ibrahim (イブラヒム) is a male Japanese Virtual YouTuber and a member of Nijisanji.",
"A former oil tycoon from the Corvus Empire.",
"Since the value of oil has fallen, he now makes a living from a hot spring that he accidentally dug up.",
"History.",
"Background.",
"Ibrahim made his YouTube debut on 1 February 2020.",
...
],
...
}
- Add Entity Type to doc_title2sents.json for each entity.
izuna385(_atmark)gmail.com
- PR and issues are welocome!