Skip to content

Latest commit

 

History

History
53 lines (33 loc) · 3.5 KB

README.md

File metadata and controls

53 lines (33 loc) · 3.5 KB

pioNER - named entity annotated datasets and GloVe models for the Armenian language

pioNER corpus provides gold-standard and automatically generated named-entity datasets for the Armenian language.

Alongside the datasets, we release 50-, 100-, 200-, and 300-dimensional GloVe word embeddings trained on a collection of Armenian texts from Wikipedia, news, blogs, and encyclopedia.

Silver-standard dataset

The generated corpus is automatically extracted and annotated using Armenian Wikipedia. We used a modification of Nothman et al and Sysoev and Andrianov approaches to create this corpus. This approach uses links between Wikipedia articles to extract fragments of named-entity annotated texts.

The corpus is split into train and development sets.

Table 1. Statistics for pioNER train, development and test sets

dataset #tokens #sents annotation texts' source
train 130719 5964 automatic Wikipedia
dev 32528 1491 automatic Wikipedia
test 53606 2529 manual iLur.am

Gold-standard dataset

This dataset is a collection of over 250 news articles from iLur.am with manual named-entity annotation. It includes sentences from political, sports, local and world news, and is comparable in size with the test sets of other languages (Table 2). We aim it to serve as a benchmark for future named entity recognition systems designed for the Armenian language.

The dataset contains annotations for 3 popular named entity classes: people (PER), organizations (ORG), and locations (LOC), and is released in CoNLL03 format with IOB tagging scheme. During annotation, we generally relied on categories and guidelines assembled by BBN Technologies for TREC 2002 question answering track

Tokens and sentences were segmented according to the UD standards for the Armenian language from ArmTreebank project.

Table 2. Comparison of pioNER gold-standard test set with test sets for English, Russian, Spanish and German

test dataset #tokens #LOC #ORG #PER
Armenian pioNER 53606 1312 1338 1274
Russian factRuEval-2016 59382 1239 1595 1353
German CoNLL03 51943 1035 773 1195
Spanish CoNLL02 51533 1084 1400 735
English CoNLL03 46453 1668 1661 1671

GloVe embeddings

We also publish GloVe word vector models trained on Armenian texts containing 79 million tokens. The training set included the articles of Armenian Wikipedia, The Armenian Soviet Encyclopedia, a subcorpus of Eastern Armenian National Corpus, and news articles from over a dozen Armenian news websites and blogs. Texts covered topics such as economics, politics, weather forecast, IT, law, society and politics, coming from non-fiction as well as fiction genres.

Similar to the original embeddings published for the English language, we release 50-, 100-, 200- and 300-dimensional word vectors for Armenian with a vocabulary size of 400000.

You can download GloVe models from here.

For more details, refer to the paper.