Code for the paper titled "Revisiting the Role of Feature Engineering for Compound Type Identification in Sanskrit"
This code is adapted and modified from this tutorial by Ruder.
The following software must be installed on your machine.
- Python 3.5
- Tensorflow 1.13.1
- numpy
- gensim
- pandas
- scikit-learn
- code : To get results reported in paper, simply run this python file.
- data : contains data required to run this code
- model : generated model will be stored to this folder
We have only provided our best word embedding model implementation i.e. FastText. Go to code/train.py file
python train.py
Description of data files . We have used same transliteration scheme as that of Hellwig's
file name | discription |
---|---|
train/test.csv | This is the dataset for compound type classification task. |
compound_dic.pickle | This file is dictionary mapping of compound classification dataset to get word embedding vectors. |
Fast_text_features | This folder contains fasttext embedding of classification dataset. |
These features can be downloaded from here
Make sure these features are placed in path : data/fast_text_features
There are four classes. They are represented by integer mapping: Avyaibhav(0), Bahuvrihi(1), Dvandva(2), Tatpurush(3)
Index | Word1 | Word2 | Class |
---|---|---|---|
1 | xqDa | vikramaH | 1 |
2 | prawi | icCakaH | 0 |
3 | saMmAna | SuSrURA | 2 |
Corpus | No of Verses | No of words |
---|---|---|
Vedabase | 13013 | 190343 |
DCS | 127376 | 3797593 |
wiki | 78K lines | 663521 |
Total | 4651457 |