Wikia/Wikipedia-NER-and-EL-Dataset-Creator

You can create datasets from Wikia/Wikipedia that can be used for both of entity recognition and Entity Linking.
Sample Dataset is available here. See also preprocessed data examples.

Sample ja-wiki dataset .

Here

Create en-wiki dataset.

Ongoing under branch feature/FixEnParseBug.

Environment Setup for Preprocessing.

$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
$ (install wikiextractor==3.0.5 from source https://github.com/attardi/wikiextractor for activate --json option.)

Dataset Preparation

For Wikia

Download [worldname]_pages_current.xml from wikia statistics page to ./dataset/.
- For example, if you are interested in Virtual Youtuber, download virtualyoutuber_pages_current.xml dump from here.

For Wikipedia

Download Wikipedia-dump from here(en) or here(ja) and unzip bzip2 file.

Sample Script for Creating EL Dataset.

$ sh ./scripts/vtuber.sh

Parameters for Creating Dataset

-augmentation_with_title_set_string_match (Default:True)
- When this parameter is True, first we construct title set from entire pages in one wikia .xml. Then, when string matches in this title set, we treat these mentions as annotated ones.
-in_document_augmentation_with_its_title (Default:True)
- When this parameter is True, we add another annotation to dataset with distant supervision from title, where the mention appears.
- For example, the page of Anakin Skywalker mentions him without anchor link, as Anakin or Skywalker.
- With this parameter on, we treat these mentions as annotated ones.
-spacy_model (Default: en_core_web_md)
- Specify spaCy model for sentence boundary detection.
- Note: SBD with spaCy is conducted only when -multiprocessing is False.
-language (Default: en)
- Specify language of document.
- When en is selected and -multiprocessing is False, spaCy is used for SBD.
- When en is selected and -multiprocessing is True, pysbd is used for SBD.
- When ja is selected, konoha is used for SBD.
-multiprocessing (Default: False)
- If True, documents after preprocessing with wikiextractor are multiprocessed.

License

Dataset was constructed using Wikias (from FANDOM) and Wikipedia. This dataset is licensed under the Creative Commons Attribution-Share Alike License (CC-BY-SA).

Preprocessed data example from Wikia.

data

`annotation.json`

key	its_content
`document_title`	Page title where the annotation exists.
`anchor_sent`	Anchored sentence with `<a>` and `</a>`. This anchor can be used for Entity Linking.
`annotation_doc_entity_title`	Which entity to be linked if the mention is disambiguated. Redirects are also considered.
`mention`	Surface form as it is in sentence where the mention appeared.
`original_sentence`	Sentence without anchors.
`original_sentence_mention_start`	Mention span start position in original sentence.
`original_sentence_mention_end`	Mention span end position in original sentence.

For instance, a real-world example of annotations.json is shown from virtualyoutuber wikia.

[
    {
        "document_title": "Melissa Kinrenka",
        "anchor_sent": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of <a> Nijisanji </a>.",
        "annotation_doc_entity_title": "Nijisanji",
        "mention": "Nijisanji",
        "original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "original_sentence_mention_start": 75,
        "original_sentence_mention_end": 84
    },
    {
        "document_title": "Melissa Kinrenka",
        "anchor_sent": "<a> Melissa Kinrenka </a> (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "annotation_doc_entity_title": "Melissa Kinrenka",
        "mention": "Melissa Kinrenka",
        "original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
        "original_sentence_mention_start": 0,
        "original_sentence_mention_end": 16
    },
    ...
]
...

`doc_title2sents.json`

Redirect-resolved title and its descriptions after sentence split are available.

{
    "Furen E Lustario": [
        "Furen E Lustario (フレン・E・ルスタリオ) is a female Japanese Virtual YouTuber and member of Nijisanji.",
        "A female knight of the Corvus Empire.",
        "Introduction Video.",
        "Furen's introduction.",
        "Personality.",
        "Furen lacks a surprising amount of common sense.",
        "It has been displayed in at least two streams that she cannot tell from left to right.",
        ...
    ],
    "Ibrahim": [
        "Ibrahim (イブラヒム) is a male Japanese Virtual YouTuber and a member of Nijisanji.",
        "A former oil tycoon from the Corvus Empire.",
        "Since the value of oil has fallen, he now makes a living from a hot spring that he accidentally dug up.",
        "History.",
        "Background.",
        "Ibrahim made his YouTube debut on 1 February 2020.",
        ...
    ],
    ...
}

WIP

Add Entity Type to doc_title2sents.json for each entity.

Contact

izuna385(_atmark)gmail.com
PR and issues are welocome!

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
dataset		dataset
preprocessed		preprocessed
scripts		scripts
wikiextractor @ 881f3e4		wikiextractor @ 881f3e4
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
create_dataset.py		create_dataset.py
parameters.py		parameters.py
requirements.txt		requirements.txt
sentencizer.py		sentencizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikia/Wikipedia-NER-and-EL-Dataset-Creator

Sample ja-wiki dataset .

Create en-wiki dataset.

Environment Setup for Preprocessing.

Dataset Preparation

For Wikia

For Wikipedia

Sample Script for Creating EL Dataset.

Parameters for Creating Dataset

License

Preprocessed data example from Wikia.

`annotation.json`

`doc_title2sents.json`

WIP

Contact

About

Releases

Packages

Languages

License

izuna385/Wikia-and-Wikipedia-EL-Dataset-Creator

Folders and files

Latest commit

History

Repository files navigation

Wikia/Wikipedia-NER-and-EL-Dataset-Creator

Sample ja-wiki dataset .

Create en-wiki dataset.

Environment Setup for Preprocessing.

Dataset Preparation

For Wikia

For Wikipedia

Sample Script for Creating EL Dataset.

Parameters for Creating Dataset

License

Preprocessed data example from Wikia.

annotation.json

doc_title2sents.json

WIP

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`annotation.json`

`doc_title2sents.json`

Packages