Skip to content

FotiosBistas/ID3-NaiveBayes

Repository files navigation

Aiexercise2

This a 3rd year project for Athens University of Economics and Business(AUEB) Artificial Intelligence course. The task is to implement the ID3 and the NAIVE BAYES algorithm and use the train and test data from the following dataset:https://ai.stanford.edu/~amaas/data/sentiment/.

You can find info about the dataset here:https://keras.io/api/datasets/imdb/

The first step is to find the most commonly used words in the whole dataset of reviews and then filter them using ENTROPY. Then after we filter the words we must create a dictionary that will contain the most negatively or positively charged words and using that determine whether a review is negative or positive(we don't about the neutral reviews in the dataset).

The algorithms will be reading 0-1 vectors for each review: 0 meaning that the specific word in the dictionary was not present in the review and 1 meaning that the word was present in the review. In the Naive Bayes algorithm we use LA PLACE estimators always.

The algorithms run through the executables: execute.py and classify.py. In execute.py we build txt files that are binary vectors and there are 4 command line arguments to be given. The first x is the percentage of data you will use to train the algorithm, the second z we is the method we use to approximate logs(they require a lot of computational power) to optimizate the gathering of data and the last in the number of keywords that the dictionary will contain. In classify.py you use the three above except the last one where you can give a dummy parameter. In classify you can do 3 things: 1.) Read the a text file, meaning a review, and see if its positive or negative 2.) You can type the review yourself 3.) You can read an existing binary vector file

The if statement in executy inside the for loops is to determine what type of test we are going to do. We want to gather a lot of different metrics like testing the accuracy on training data while training the algorithms on training data etc. that's why it's done.

How we approximate the logs

1.) We have two categories p,q (negative and positive reviews). This is the entropy algorithm:

image

2.) Using change of base:

image

We end up with this:

image

Because we are finding entropies we values for the above equation are always between 0 and 1. We know:

image

For values between 0 and 1:

image

3.) We can conclude:

image

4.) The entropy becomes:

image

5.) We can change the fractions to:

image

Below are different metrics like precision,accuracy etc. for each algorithm

Also we are doing different types of tests like accuracy on training data, accuracy on testing data etc.

image

The below peaks are due to errors in the data: image

image

image

image

image

Below is a comparison of accuracy between different implementations from different teams

image

The commit history is a bit messy due the vast amount of files we needed to upload.