Italian word embeddings

Data source

The source for the data is the Italian Wikipedia, downloaded from Wikipedia Dumps.

Preprocessing

The goal is to produce a single text file with the content of the Wikipedia pages, with a whitespaced tokenization. Usually for the tokenization the approach is to remove punctuation, but I want to get word embeddings also for punctuation (because I don't want to discard any information provided by an input sentence). For producing this type of input, and also because I want to have an alignement between the tokenization used to train word embeddings and the tokenization I am using at runtime, I chose to use SpaCy for its great power and speed. SpaCy comes with word embeddings of this kind for the English language.

Two types of preprocessing have been tried:

using spacy-dev-resources
using wikiextractor + SpaCy for tokenization

Training word embeddings

GloVe is used to produce a text file that contains:

number_of_vectors vector_length
WORD1 values_of_word_1
WORD2 values_of_word_2
...

Preparing SpaCy vectors

From the representation of word embeddings in text file, a binary representation is built, ready to be loaded into SpaCy.

The whole SpaCy model (a blank italian nlp + the word vectors) is saved and packaged using the script number 3.

Using the model

Option 1: do the preceding steps to train the vectors and then load the vectors with nlp.vocab.vectors.from_disk('path').

Option 2: install with pip the complete model from the latest release with the following command:

pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz

then simply load the model in SpaCy with nlp = spacy.load('it_vectors_wiki_lg').

If you want to use the vectors in another environment (outside SpaCy) you can find the raw embeddings in the vectors-1.0 release which contains

Evaluation

The questions-words-ITA.txt come from http://hlt.isti.cnr.it/wordembeddings/ as part of the paper:

@inproceedings{berardi2015word,
  title={Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.},
  author={Berardi, Giacomo and Esuli, Andrea and Marcheggiani, Diego},
  booktitle={IIR},
  year={2015}
}

The preprocessing + the new dump of wikipedia gives the following results (script accuracy.py): 58.14% that seems an improvement with respect to the scores in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GloVe @ 6f40601		GloVe @ 6f40601
spacy-dev-resources @ cedebe0		spacy-dev-resources @ cedebe0
wikiextractor @ 53732bf		wikiextractor @ 53732bf
.gitignore		.gitignore
.gitmodules		.gitmodules
1a_preprocessing.sh		1a_preprocessing.sh
1b_preprocessing.sh		1b_preprocessing.sh
2_train.sh		2_train.sh
3_export_spacy.sh		3_export_spacy.sh
LICENSE		LICENSE
README.md		README.md
accuracy.py		accuracy.py
questions-words-ITA.txt		questions-words-ITA.txt
spacy_vectors.py		spacy_vectors.py
tokenization.py		tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Italian word embeddings

Data source

Preprocessing

Training word embeddings

Preparing SpaCy vectors

Using the model

Evaluation

About

Releases 3

Sponsor this project

Packages

Languages

License

MartinoMensio/it_vectors_wiki_spacy

Folders and files

Latest commit

History

Repository files navigation

Italian word embeddings

Data source

Preprocessing

Training word embeddings

Preparing SpaCy vectors

Using the model

Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Sponsor this project

Packages 0

Languages

Packages