Natural Language Processing via Deep Learning

This repository contains the course materials for "Introduction to Natural Language Processing" at Galatasaray University

Slides

First week

Project - Classification of books by genres

In this NLP project we use neural networks to predict the genre of books by using its summary. We work on a dataset on books downloaded from https://www.kaggle.com/datasets/athu1105/book-genre-prediction

DATA
number of lines	4657
number of columns	4

This project has several versions of notebooks. The aim of this versioning is completely pedagogical. We improve the deep learning model step by step. In more details we have

Notebook v1

This is the very first version and we do almost no preprocessing. Word Embedding : GloVe 50d

Notebook v2

+ The notebook is more readable with the added comments
+ Section "3. Preprocessing the textual data"
+ Removal of stop-words
+ We try stemming

Although we see slight improvement on the accuracy, it is far from being good. Notice that because of stemming there are too many words which are not represented in the transfered model, GloVe:

Converted 12964 words (7036 misses)

In the previous version we had 1822 misses. This may show why it is not a good idea not to apply stemming.

When you stem a word, say "leaving" or "studies" it is converted into "leav" or "studi". After stemming the new token does not need to be a true word, that is why, it may not be represented in a pretrained model. As a conclusion, it is wise to use lemmatization instead of stemming here.

Notebook v3

- GloVe 50d
+ GloVe 300d

Using 300d model instead of 50d gives better result in terms of the accuracy on the training set. However the model is not general enough, since the accuracy on the test data is too weak.

Notebook v4

- stemming
+ lemmatization
+ 10 more epochs

With lemmatization, there is a bigger number of words which are represented in the pretrained model.

Converted 18392 words (1608 misses)

We haven't use the callback method yet. But together with the changes above, adding more epoch improved well the model.

version	Model performance	comments
v1	loss: 1.1478 - acc: 0.5899 - val_loss: 1.6672 - val_acc: 0.4765
v2	loss: 1.0414 - acc: 0.6228 - val_loss: 2.1074 - val_acc: 0.4483
v3	loss: 0.6292 - acc: 0.7876 - val_loss: 2.1436 - val_acc: 0.4430	at the 15th epoch
v4	loss: 0.3467 - acc: 0.8859 - val_loss: 2.7714 - val_acc: 0.5624	at the 25th epoch

++++++++++++++++++++++++++++++++++++++++++++++++++++

This project is created for the course MATH410 Natural Language Processing with Deep Learning in the Master Program in Data Science at Galatasaray University. https://ects.gsu.edu.tr/en/program/index/193

This work is supported by SFEIR https://www.sfeir.com/fr/

Credits

I use course materials from

https://openclassrooms.com/

https://www.coursera.org

https://web.stanford.edu/class/cs224n/

╔═╗─╔╗──╔╗────────╔╗─╔╗────────────────────────╔═══╗
║║╚╗║║─╔╝╚╗───────║║─║║────────────────────────║╔═╗║
║╔╗╚╝╠═╩╗╔╬╗╔╦═╦══╣║─║║──╔══╦═╗╔══╦╗╔╦══╦══╦══╗║╚═╝╠═╦══╦══╦══╦══╦══╦╦═╗╔══╗
║║╚╗║║╔╗║║║║║║╔╣╔╗║║─║║─╔╣╔╗║╔╗╣╔╗║║║║╔╗║╔╗║║═╣║╔══╣╔╣╔╗║╔═╣║═╣══╣══╬╣╔╗╣╔╗║
║║─║║║╔╗║╚╣╚╝║║║╔╗║╚╗║╚═╝║╔╗║║║║╚╝║╚╝║╔╗║╚╝║║═╣║║──║║║╚╝║╚═╣║═╬══╠══║║║║║╚╝║
╚╝─╚═╩╝╚╩═╩══╩╝╚╝╚╩═╝╚═══╩╝╚╩╝╚╩═╗╠══╩╝╚╩═╗╠══╝╚╝──╚╝╚══╩══╩══╩══╩══╩╩╝╚╩═╗║
───────────────────────────────╔═╝║─────╔═╝║────────────────────────────╔═╝║
───────────────────────────────╚══╝─────╚══╝────────────────────────────╚══╝

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
files		files
slides		slides
.gitignore		.gitignore
Data Engineering introduction .ipynb		Data Engineering introduction .ipynb
Exercise_Neural_Networks_1.ipynb		Exercise_Neural_Networks_1.ipynb
Exercise_Neural_Networks_2.ipynb		Exercise_Neural_Networks_2.ipynb
Exercise_Neural_Networks_3.ipynb		Exercise_Neural_Networks_3.ipynb
Exercise_Neural_Networks_4.ipynb		Exercise_Neural_Networks_4.ipynb
LICENSE		LICENSE
Project_Classification_of_Books_v1.ipynb		Project_Classification_of_Books_v1.ipynb
Project_Classification_of_Books_v2.ipynb		Project_Classification_of_Books_v2.ipynb
Project_Classification_of_Books_v3.ipynb		Project_Classification_of_Books_v3.ipynb
Project_Classification_of_Books_v4.ipynb		Project_Classification_of_Books_v4.ipynb
README.md		README.md
week1_PythonPackages.ipynb		week1_PythonPackages.ipynb
week2_GensimWordVectorVisualisation.ipynb		week2_GensimWordVectorVisualisation.ipynb
week2_TextPreprocessing.ipynb		week2_TextPreprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing via Deep Learning

Slides

Project - Classification of books by genres

Notebook v1

Notebook v2

Notebook v3

Notebook v4

About

Releases

Packages

Languages

License

yasarigno/NLP_DeepLearning_Course

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing via Deep Learning

Slides

Project - Classification of books by genres

Notebook v1

Notebook v2

Notebook v3

Notebook v4

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages