Aiexercise2

This a 3rd year project for Athens University of Economics and Business(AUEB) Artificial Intelligence course. The task is to implement the ID3 and the NAIVE BAYES algorithm and use the train and test data from the following dataset:https://ai.stanford.edu/~amaas/data/sentiment/.

You can find info about the dataset here:https://keras.io/api/datasets/imdb/

The first step is to find the most commonly used words in the whole dataset of reviews and then filter them using ENTROPY. Then after we filter the words we must create a dictionary that will contain the most negatively or positively charged words and using that determine whether a review is negative or positive(we don't about the neutral reviews in the dataset).

The algorithms will be reading 0-1 vectors for each review: 0 meaning that the specific word in the dictionary was not present in the review and 1 meaning that the word was present in the review. In the Naive Bayes algorithm we use LA PLACE estimators always.

The algorithms run through the executables: execute.py and classify.py. In execute.py we build txt files that are binary vectors and there are 4 command line arguments to be given. The first x is the percentage of data you will use to train the algorithm, the second z we is the method we use to approximate logs(they require a lot of computational power) to optimizate the gathering of data and the last in the number of keywords that the dictionary will contain. In classify.py you use the three above except the last one where you can give a dummy parameter. In classify you can do 3 things: 1.) Read the a text file, meaning a review, and see if its positive or negative 2.) You can type the review yourself 3.) You can read an existing binary vector file

The if statement in executy inside the for loops is to determine what type of test we are going to do. We want to gather a lot of different metrics like testing the accuracy on training data while training the algorithms on training data etc. that's why it's done.

How we approximate the logs

1.) We have two categories p,q (negative and positive reviews). This is the entropy algorithm:

2.) Using change of base:

We end up with this:

Because we are finding entropies we values for the above equation are always between 0 and 1. We know:

For values between 0 and 1:

3.) We can conclude:

4.) The entropy becomes:

5.) We can change the fractions to:

Below are different metrics like precision,accuracy etc. for each algorithm

Also we are doing different types of tests like accuracy on training data, accuracy on testing data etc.

The below peaks are due to errors in the data:

Below is a comparison of accuracy between different implementations from different teams

the other team members were:Georgios E. Syros, Anastasios Toumazatos,Evgenios Gkritsis

The commit history is a bit messy due the vast amount of files we needed to upload.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
Accuracy_on_Training_data_Bayes		Accuracy_on_Training_data_Bayes
Accuracy_on_Training_data_Bayes_serial		Accuracy_on_Training_data_Bayes_serial
Accuracy_on_Training_data_ID3		Accuracy_on_Training_data_ID3
Accuracy_on_Training_data_ID3_serial		Accuracy_on_Training_data_ID3_serial
ErrorTrain_ErrorTest_Bayes		ErrorTrain_ErrorTest_Bayes
ErrorTrain_ErrorTest_Bayes_serial		ErrorTrain_ErrorTest_Bayes_serial
ErrorTrain_ErrorTest_ID3		ErrorTrain_ErrorTest_ID3
ErrorTrain_ErrorTest_ID3_serial		ErrorTrain_ErrorTest_ID3_serial
ID3		ID3
__pycache__		__pycache__
aclImdb		aclImdb
filtering		filtering
naive_bayes		naive_bayes
per_accur_pres_rec_f_Bayes		per_accur_pres_rec_f_Bayes
per_accur_pres_rec_f_Bayes_serial		per_accur_pres_rec_f_Bayes_serial
per_accur_pres_rec_f_ID3		per_accur_pres_rec_f_ID3
per_accur_pres_rec_f_ID3_serial		per_accur_pres_rec_f_ID3_serial
per_keys		per_keys
per_keys_serial		per_keys_serial
per_keys_test		per_keys_test
per_keys_test_serial		per_keys_test_serial
preprocessing		preprocessing
LICENSE		LICENSE
README.md		README.md
spiderman.txt		spiderman.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aiexercise2

How we approximate the logs

Below are different metrics like precision,accuracy etc. for each algorithm

the other team members were:Georgios E. Syros, Anastasios Toumazatos,Evgenios Gkritsis

About

Contributors 2

License

FotiosBistas/ID3-NaiveBayes

Folders and files

Latest commit

History

Repository files navigation

Aiexercise2

How we approximate the logs

Below are different metrics like precision,accuracy etc. for each algorithm

the other team members were:Georgios E. Syros, Anastasios Toumazatos,Evgenios Gkritsis

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2