First, we use panda
library to form our dataframes:
train_df = pd.read_csv(BOOKS_TRAIN_PATH)
test_df = pd.read_csv(BOOKS_TEST_PATH)
Then, we preprocess on our dataframes using panda
library's built in functions and hazm
library's normalize
and tokenize
functions to convert our text to an array of words.
def normalize_text(text):
text = norm.normalize(text)
return tok.tokenize(text)
def preprocess_df(dataframe):
dataframe.apply(normalize_row, axis=1)
We form our BoW which is a 6 x num_of_words dict which:
bow[c][w] = number of times word: w appeard in books with category: c
bow = dict()
for c in CATEGORIES:
bow[c] = dict()
for _, book in dataframe.iterrows():
for word in book.title:
if not word in bow[CATEGORIES[0]]:
for c in CATEGORIES:
bow[c][word] = 0
bow[book.categories][word] += 1
for word in book.description:
if not word in bow[CATEGORIES[0]]:
for c in CATEGORIES:
bow[c][word] = 0
bow[book.categories][word] += 1
return bow
I defined 3 functions here:
prob_word_if_cat(bow, word, category, dot_product)
: Probability of havingword
oncategory
. Ifword
is not present inbow
orcategory
, Additive Smoothing rule is applied (with alpha = 1).prob_cat_if_book(bow, book, category, dot_product)
: Probability ofbook
being incategory
. Calculated using Bayes theorem.predict_cat(test_df, bow)
: Runs a loop over all books intest_df
and calculates probability for each book being on each category and chooses the category with maximum probability as the answer.
Note: dot_product
is (as you might have guessed), the dot product of bow
and a 1 x n
matrix full of ones.
Another Note: Since P(C)
for every category is 1/6 (because the number of all books in each category is equal), we can ignore that in our summation. Also, we can return the summation result in function #2 because e^x
and x
are both ascending functions.
First of all, the program is slow (I mean, really, really slow). It takes 20 seconds to run (this is almost the most accurate run with all optimizations on).
❯ python3 src/main.py
Reading CSV: 0.1304283059998852
Preprocessing: 11.857562787000006
Creating BoW: 3.8803586969997923
Prediction: 0.44412514200030273
Accuracy: 82.66666666666667%
-
We can remove stop words (conjuctions, numeric words and
hazm
'sstopwords_list
from our BoW:def filter_row(row): row.title = list(filter(is_important, row.title)) row.description = list(filter(is_important, row.description)) return row def is_important(word): if re.search(r'\d', word): return False if word in CONJUCTIONS: return False if word in STOP_WORDS: return False return True
-
We can also lemmitize our words using
hazm
'sLemmatizer
:def lemmatize_row(row): row.title = list(map(clean_word, row.title)) row.description = list(map(clean_word, row.description)) return row def clean_word(word): word = lem.lemmatize(word).split("#")[-1] return word def preprocess_df(dataframe): dataframe.apply(normalize_row, axis=1) dataframe.apply(filter_row, axis=1) dataframe.apply(lemmatize_row, axis=1)
Now we have and array of roots of each word in persian, so words from same root with different shapes become the same and easier to proccess on.
- Let's have a guess: words that in title are more important than words in description. So what if we give them weights?
for word in book.title:
if not word in bow[CATEGORIES[0]]:
for c in CATEGORIES:
bow[c][word] = 0
bow[book.categories][word] += WEIGHT
for word in book.description:
if not word in bow[CATEGORIES[0]]:
for c in CATEGORIES:
bow[c][word] = 0
bow[book.categories][word] += 1
WEIGHT |
1 | 5 | 10 | 100 |
---|---|---|---|---|
Accuracy | 81.7% | 82.6% | 82.6% | 79.1% |
Yay! It seems like WEIGHT = 5
is a good one.
- Using additive smoothing:
def prob_word_if_cat(bow, word, category, dot_product):
if word in bow[CATEGORIES[0]]:
n_w = bow[category][word]
if n_w == 0:
return ALPHA / (dot_product[category] + ALPHA * len(bow[CATEGORIES[0]]))
return n_w / dot_product[category]
else:
return ALPHA / (dot_product[category] + ALPHA * len(bow[CATEGORIES[0]]))
Now let's see the result with different ALPHA
values:
Alpha | 0.01 | 0.1 | 1 | 10 | 100 |
---|---|---|---|---|---|
Accuracy (%) | 76.2% | 78.2% | 82.6% | 79.3% | 78.2% |
- | Removing Stop Words | Keeping Stop Words |
---|---|---|
Lemmatize | 16s, 82.6% | 20s, 71.5% |
No Lemmatize | 15s, 78.8% | 16s, 73.3% |
- | Removing Stop Words | Keeping Stop Words |
---|---|---|
Lemmatize | 16s, 4.2% | 18s, 7.5% |
No Lemmatize | 15s, 0.8% | 16s, 1.3% |
First of all, our data is very limited. only ~26k words are present in BoW and that is all we got. When a new word appears in test cases, ignoring it wouldn't be logical. The additive smoothing technique helps us to calculate a prediction for that situation, and also it smoothes the probability distribution graph.
Gussed -> | مدیریت و کسب و کار | رمان | کلیات اسلام | داستان کودک و نوجوانان | جامعهشناسی | داستان کوتاه | Accuracy (%) |
---|---|---|---|---|---|---|---|
مدیریت کسب و کار | 69 | 0 | 0 | 0 | 6 | 0 | 92% |
رمان | 1 | 53 | 1 | 2 | 1 | 17 | 70% |
کلیات اسلام | 0 | 0 | 62 | 3 | 9 | 1 | 82% |
داستان کودک و نوجوانان | 1 | 4 | 2 | 64 | 0 | 4 | 85% |
جامعهشناسی | 5 | 1 | 2 | 1 | 65 | 1 | 86% |
داستان کوتاه | 1 | 8 | 1 | 6 | 1 | 58 | 77% |
Total Accuracy | - | - | - | - | - | - | 82.6% |