This repository contains implementations and analyses for binary and multiclass classification tasks using logistic regression, support vector machines (SVMs), and natural language processing (NLP). The project includes data preprocessing, model training, hyperparameter tuning, evaluation, and visualization of results.
- Overview
- Tasks and Features
- Data Preparation
- Binary Classification
- Multiclass Classification
- Results and Outputs
- Dependencies and Installation
- Usage
- Examples
- Future Work
- License
This project demonstrates various machine learning techniques applied to text data, focusing on:
- Logistic regression with Elastic Net regularization.
- Support vector machines with multiple kernel types.
- Natural language processing tasks, including preprocessing, tokenization, and feature vectorization.
The dataset consists of text sentences authored by famous writers, and the goal is to classify these sentences into binary or multiclass categories.
-
Gradient Descent for Logistic Regression:
- Implemented with Elastic Net regularization.
- Visualized decision boundaries and loss trends.
-
Support Vector Machines:
- Explored the impact of different kernels (linear, polynomial, RBF) and hyperparameters.
-
Natural Language Processing:
- Preprocessing: Tokenization, stopword removal, lemmatization, stemming.
- Feature extraction: Bag of Words (BoW) and TF-IDF vectorization.
-
Binary Classification:
- Compared logistic regression and SVM models for a two-class problem.
-
Multiclass Classification:
- Extended logistic regression to multiclass classification using the One-vs-One strategy.
-
Visualization:
- Confusion matrices, ROC curves, and decision boundaries.
-
Dataset Creation:
- Selected six authors: Pushkin, Dostoevsky, Tolstoy, Chekhov, Gogol, and Turgenev.
- Extracted sentences from their works.
- Dropped sentences shorter than 15 characters.
- Created a balanced dataset with specified sample sizes per author.
-
Preprocessing:
- Tokenized sentences into words.
- Removed stopwords, punctuation, and numbers.
- Applied stemming or lemmatization for normalization.
-
Vectorization:
- Bag of Words (BoW): Encodes the frequency of words.
- TF-IDF: Assigns weights to words based on their importance in the text.
Classify sentences written by two authors (e.g., Pushkin and Dostoevsky).
- Preprocess and vectorize the data.
- Split the dataset into training (70%) and testing (30%).
- Train models:
- Logistic Regression
- SVM with a linear kernel
- Use
GridSearchCV
to tune hyperparameters based on F1-score. - Evaluate:
- Metrics: Accuracy, precision, recall, F1-score.
- Visualizations: Confusion matrices, ROC curves.
Classify sentences into one of six classes (authors).
- Preprocess and vectorize the data.
- Split the dataset into training (70%) and testing (30%).
- Train a multiclass logistic regression model using the One-vs-One strategy.
- Use
GridSearchCV
for hyperparameter optimization. - Evaluate:
- Metrics: Weighted accuracy, precision, recall, F1-score.
- Visualizations: Confusion matrices.
- Models: Logistic Regression and SVM
- Metrics:
- AUC for both models: ~0.90.
- Slight edge in performance for SVM in terms of precision and recall.
- Visualization:
- ROC curves with thresholds for 30% false positive rate.
- Model: Logistic Regression with One-vs-One.
- Metrics:
- Accuracy: ~57.6%.
- Highest classification accuracy for Gogol; lowest for Turgenev.
- Confusion Matrix:
- Gogol's sentences were easier to classify, while Pushkin's and Turgenev's were often misclassified.
- Original Sentence:
- "Владимир отпер комоды и ящики, занялся разбором бумаг."
- Processed Sentence:
- "владимир отпер комод ящик заня разбор бумаг"
- Experiment with deep learning models for text classification.
- Incorporate embeddings like Word2Vec or BERT for feature extraction.
- Expand the dataset to include more authors and balanced samples.