Logistic Regression, SVM, and Natural Language Processing for Classification Tasks

This repository contains implementations and analyses for binary and multiclass classification tasks using logistic regression, support vector machines (SVMs), and natural language processing (NLP). The project includes data preprocessing, model training, hyperparameter tuning, evaluation, and visualization of results.

Overview

This project demonstrates various machine learning techniques applied to text data, focusing on:

Logistic regression with Elastic Net regularization.
Support vector machines with multiple kernel types.
Natural language processing tasks, including preprocessing, tokenization, and feature vectorization.

The dataset consists of text sentences authored by famous writers, and the goal is to classify these sentences into binary or multiclass categories.

Tasks and Features

Gradient Descent for Logistic Regression:
- Implemented with Elastic Net regularization.
- Visualized decision boundaries and loss trends.
Support Vector Machines:
- Explored the impact of different kernels (linear, polynomial, RBF) and hyperparameters.
Natural Language Processing:
- Preprocessing: Tokenization, stopword removal, lemmatization, stemming.
- Feature extraction: Bag of Words (BoW) and TF-IDF vectorization.
Binary Classification:
- Compared logistic regression and SVM models for a two-class problem.
Multiclass Classification:
- Extended logistic regression to multiclass classification using the One-vs-One strategy.
Visualization:
- Confusion matrices, ROC curves, and decision boundaries.

Data Preparation

Steps:

Dataset Creation:
- Selected six authors: Pushkin, Dostoevsky, Tolstoy, Chekhov, Gogol, and Turgenev.
- Extracted sentences from their works.
- Dropped sentences shorter than 15 characters.
- Created a balanced dataset with specified sample sizes per author.
Preprocessing:
- Tokenized sentences into words.
- Removed stopwords, punctuation, and numbers.
- Applied stemming or lemmatization for normalization.
Vectorization:
- Bag of Words (BoW): Encodes the frequency of words.
- TF-IDF: Assigns weights to words based on their importance in the text.

Binary Classification

Objective:

Classify sentences written by two authors (e.g., Pushkin and Dostoevsky).

Steps:

Preprocess and vectorize the data.
Split the dataset into training (70%) and testing (30%).
Train models:
- Logistic Regression
- SVM with a linear kernel
Use GridSearchCV to tune hyperparameters based on F1-score.
Evaluate:
- Metrics: Accuracy, precision, recall, F1-score.
- Visualizations: Confusion matrices, ROC curves.

Multiclass Classification

Objective:

Classify sentences into one of six classes (authors).

Steps:

Preprocess and vectorize the data.
Split the dataset into training (70%) and testing (30%).
Train a multiclass logistic regression model using the One-vs-One strategy.
Use GridSearchCV for hyperparameter optimization.
Evaluate:
- Metrics: Weighted accuracy, precision, recall, F1-score.
- Visualizations: Confusion matrices.

Results and Outputs

Binary Classification:

Models: Logistic Regression and SVM
Metrics:
- AUC for both models: ~0.90.
- Slight edge in performance for SVM in terms of precision and recall.
Visualization:
- ROC curves with thresholds for 30% false positive rate.

Multiclass Classification:

Model: Logistic Regression with One-vs-One.
Metrics:
- Accuracy: ~57.6%.
- Highest classification accuracy for Gogol; lowest for Turgenev.
Confusion Matrix:
- Gogol's sentences were easier to classify, while Pushkin's and Turgenev's were often misclassified.

Preprocessing Examples:

Original Sentence:
- "Владимир отпер комоды и ящики, занялся разбором бумаг."
Processed Sentence:
- "владимир отпер комод ящик заня разбор бумаг"

Future Work

Experiment with deep learning models for text classification.
Incorporate embeddings like Word2Vec or BERT for feature extraction.
Expand the dataset to include more authors and balanced samples.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
PY.ipynb		PY.ipynb
README.md		README.md
gitattributes.txt		gitattributes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logistic Regression, SVM, and Natural Language Processing for Classification Tasks

Table of Contents

Overview

Tasks and Features

Data Preparation

Steps:

Binary Classification

Objective:

Steps:

Multiclass Classification

Objective:

Steps:

Results and Outputs

Binary Classification:

Multiclass Classification:

Preprocessing Examples:

Future Work

About

Releases

Packages

Languages

Kxrma47/Logistic-Regression-SVM-and-Natural-Language-Processing-for-Classification-Tasks

Folders and files

Latest commit

History

Repository files navigation

Logistic Regression, SVM, and Natural Language Processing for Classification Tasks

Table of Contents

Overview

Tasks and Features

Data Preparation

Steps:

Binary Classification

Objective:

Steps:

Multiclass Classification

Objective:

Steps:

Results and Outputs

Binary Classification:

Multiclass Classification:

Preprocessing Examples:

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages