It is a Java implementation of the paper: Dependency Based Word Embeddings, published by Levy et al. in ACL, and extensions.
This algorithm uses the Skip-Gram method and train with shallow neural network, the input corpus is pre-processed by Stanford Dependency Parser. For more information of word embedding technique, it is better to search the related information online. Usage already shown in examples.
- DL4J, its GitHub page: [link], and Maven source: [link].
- ND4J, its GitHub page: [link], and Maven source: [link].
- Stanford NLP, its GitHub page: [link], and Maven sources: [link] (For Maven, please import both corenlp and corenlp with classifier
models
snippets). - Guava, its Maven sources: [link].
The Word2Vecf project is a modification of the original Word2Vec proposed by Mikolov, allowing:
- performing multiple iterations over the data.
- the use of arbitrary context features.
- dumping the context vectors at the end of the process
Unlike the original Word2Vec project, which can be used directly, the Word2Vecf needs some pre-computations, since the Word2Vecf DOES NOT handle vocabulary construction and DOES NOT read a sentence or paragraph as input directly.
The expected files are:
- word_vocabulary: file mapping words (strings) to their counts.
- context_vocabulary: file mapping contexts (strings) to their counts, used for constructing the sampling table for the negative training.
- training_data: textual file of word-context pairs. each pair takes a separate line. the format of a pair is "(word context)", i.e. space delimited, where and are strings. if we want to prefer some contexts over the others, we should construct the training data to contain the bias.
In order to make the project more usable, the pre-computations are implemented inside the project too. Since the Word2Vecf project is dependency-based word embeddings, the stanford dependency parser is used, more usage information can be found in its website.
- WordSim353: The WordSim353 set contains 353 word pairs. It was constructed by asking human subjects to rate the degree of semantic similarity or relatedness between two words on a numerical scale. The performance is measured by the Pearson correlation of the two word embeddings’ cosine distance and the average score given by the participants. [pdf]
- TOEFL: The TOEFL set contains 80 multiple-choice synonym questions, each with 4 candidates. For example, the question word levied has choices: imposed (correct), believed, requested and correlated. Choose the nearest neighbor of the question word from the candidates based on the cosine distance and use the accuracy to measure the performance. [pdf]
- Analogy: The analogy task has approximately 9K semantic and 10.5K syntactic analogy questions. The question are similar to “man is to (woman) as king is to queen” or “predict is to (predicting) as dance is to dancing”. Following the previous work, using the nearest neighbor of "queen − king + man" in the vocabulary as the answer. Additionally, the accuracy is used to measure the performance. This dataset is relatively large compared to the previous two sets; therefore, the results using this dataset are more stable than those using the previous two datasets. [pdf]
- eikdk/Word2VecJava
- word2vec -- google sources, download
- Yoav Goldberg/word2vecf
- orenmel/lexsub
- GoogleNews-vectors-negative300.bin (Pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors))