This project focuses on classifying news articles into various predefined categories using Natural Language Processing (NLP) and Machine Learning (ML) techniques. The main objective is to create a model that accurately categorizes articles into topics like Politics, Economics, Sports, Technology, Social, Cultural, and Miscellaneous.
- Overview
- Dataset
- Model Architecture
- Training Process
- Evaluation
- Results
- Future Improvements
- User Interface
This project leverages NLP techniques to preprocess and classify news articles. The model is designed to efficiently handle large volumes of textual data, performing one category per article. The key steps involve text preprocessing, feature extraction, and model training using various ML algorithms:
The dataset includes thousands of news articles labeled under seven different categories:
Politics, Economics, Sports, Technology, Social, Cultural, and Miscellaneous.
These categories were selected to cover a wide range of topics of general interest.
- Text Preprocessing: Tokenization, stopword removal, and punctuation cleaning were performed using Python's NLP libraries.
- Feature Extraction: The TF-IDF method was used to convert text into numerical features. We considered different n-gram ranges for feature extraction and experimented with the number of features to optimize model performance.
The classification model is built using a neural network with the following architecture:
- Input Layer: Accepts the TF-IDF feature vectors.
- Dense Layer 1: Contains 32 units with ReLU activation.
- Dense Layer 2: Contains 32 units with ReLU activation.
- Output Layer: For mono-label classification, the output layer has 9 units (one for each category) with a softmax activation function.
- Loss Function: Categorical Crossentropy.
- Optimizer: Adam optimizer was used to minimize the loss function.
The model was trained on a split dataset:
- Training Data: 80% of the dataset.
- Validation Data: 20% of the dataset.
- Batch Size: 32
- Epochs: 10
The training process involved backpropagation to minimize loss and improve accuracy.
Evaluation metrics such as accuracy, precision, recall, and F1-score were used to assess the performance of the model. In addition, Confusion Matrix was used to analyze misclassification.
- Training Accuracy: 98%
- Test Accuracy: 83%
- Training Loss: 0.1
- Test Loss: 0.8
- Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and optimizer algorithms.
- Data Augmentation: Increase dataset size by scraping more news articles to improve generalization.
- Advanced NLP Techniques: Implement models like BERT or GPT for improved classification accuracy.
A web-based user interface was created using React on the front end and Django (FastAPI) on the back end. This interface allows users to:
- Upload new articles for classification.
- View classification results instantly on the dashboard.
- Analyze model performance through real-time visualizations and feedback.
This project is under the MIT License, and I’d be thrilled if you use and improve my work!