PhoW2V provides collections of pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese, that were pre-trained on a 20GB corpus of Vietnamese texts and used for our EMNLP-2020 Findings work "A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese":
@inproceedings{phow2v_vitext2sql,
title = {{A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese}},
author = {Anh Tuan Nguyen and Mai Hoang Dao and Dat Quoc Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year = {2020},
pages = {4079--4085}
}
Pre-trained embeddings | Syllable/Word | Embedding size | Download mirror |
---|---|---|---|
PhoW2V_syllables_100dims | Syllable-level | 100 | Mirror |
PhoW2V_syllables_300dims | Syllable-level | 300 | Mirror |
PhoW2V_words_100dims | Word-level | 100 | Mirror |
PhoW2V_words_300dims | Word-level | 300 | Mirror |
By downloading the PhoW2V embeddings, USER agrees:
- To use PhoW2V for research or educational purposes only.
- Not to distribute PhoW2V or part of PhoW2V in any original or modified form.
- To cite our EMNLP-2020 Findings paper above when PhoW2V is employed to help produce published results.
- Users should perform Vietnamese tone normalization on downstream tasks' data as this pre-process was also applied to the 20GB pre-training corpus of Vietnamese texts. A Python script for Vietnamese tone normalization is available at HERE.