There are three steps to extract collocations
- Preprocessing
- Removing stopwords
- Stemming and Lemmatization
- Extracting the Bigrams and Trigrams
- Generating Collocations using related methods
- Raw Frequency
- PMI
- T-Test
- Chi-Square
- Likelihood Ratio
- Poisson Stirling
There are three steps to build a classifier for classification
- Preprocessing
- Removing stopwords
- Tokenization
- Stemming
- Classifier
- Evaluation
- Dataset words converted to lowercase format
- Punctuation marks from the dataset words are removed
- Tokenization process by using filtering options on the dataset words like extracting stopwords, applying some regex patterns
- Stemmization applied on the dataset words
- Files from dataset are read
- Labels created according to dataset
- Data.json file is created which holds labels we decided
- Write data into CSV file
- If there is no available train set created before, the csv file is created
- Read train set from CSV
- If there is a previously created csv file that is available, the file is read. “Suç” and “İçtihat” are prepared as lists
- Split Dataset
- Common approach is used for splitting the dataset which was 80% for training set and 20% for test set
- Vectorize
- Using the data from our test set, a TF-IDF matrix is created
- Support Vector Machines (specifically, linear SVM)
- Multinomial Naive Bayes
- Logistic Regression
- FastText
- LSTM
- Dataset is taken from previous iteration
- Labels created according to dataset
- Label names concatenated with underscore to prevent ambiguity such as
__label__
tag is added to labels for model creation
Thanks goes to these wonderful people (emoji key):
Anıl Şenay |
Bilgehan Geçici |
Kürşat Açıkgöz |
Beyza |
Ahmet Önkol |
Ahmet Elburuz Gürbüz |
This project follows the all-contributors specification. Contributions of any kind welcome!