Skip to content

This project classifies news articles into categories using NLP and ML techniques. It features a user-friendly web interface.

License

Notifications You must be signed in to change notification settings

navidadkhah/News-Category-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Category Classification

This project focuses on classifying news articles into various predefined categories using Natural Language Processing (NLP) and Machine Learning (ML) techniques. The main objective is to create a model that accurately categorizes articles into topics like Politics, Economics, Sports, Technology, Social, Cultural, and Miscellaneous.

Table of Contents

Overview

This project leverages NLP techniques to preprocess and classify news articles. The model is designed to efficiently handle large volumes of textual data, performing one category per article. The key steps involve text preprocessing, feature extraction, and model training using various ML algorithms:

Road Map

Dataset

The dataset includes thousands of news articles labeled under seven different categories:

Politics, Economics, Sports, Technology, Social, Cultural, and Miscellaneous.

These categories were selected to cover a wide range of topics of general interest.

Preprocessing

  • Text Preprocessing: Tokenization, stopword removal, and punctuation cleaning were performed using Python's NLP libraries.
  • Feature Extraction: The TF-IDF method was used to convert text into numerical features. We considered different n-gram ranges for feature extraction and experimented with the number of features to optimize model performance.

Model Architecture

The classification model is built using a neural network with the following architecture:

  1. Input Layer: Accepts the TF-IDF feature vectors.
  2. Dense Layer 1: Contains 32 units with ReLU activation.
  3. Dense Layer 2: Contains 32 units with ReLU activation.
  4. Output Layer: For mono-label classification, the output layer has 9 units (one for each category) with a softmax activation function.

Model

Loss and Optimizer

  • Loss Function: Categorical Crossentropy.
  • Optimizer: Adam optimizer was used to minimize the loss function.

Training Process

The model was trained on a split dataset:

  • Training Data: 80% of the dataset.
  • Validation Data: 20% of the dataset.
  • Batch Size: 32
  • Epochs: 10

The training process involved backpropagation to minimize loss and improve accuracy.

Accuracy Loss

Evaluation

Evaluation metrics such as accuracy, precision, recall, and F1-score were used to assess the performance of the model. In addition, Confusion Matrix was used to analyze misclassification.

Confusion

Results

  • Training Accuracy: 98%
  • Test Accuracy: 83%
  • Training Loss: 0.1
  • Test Loss: 0.8

Future Improvements

  1. Hyperparameter Tuning: Experiment with different learning rates, batch sizes, and optimizer algorithms.
  2. Data Augmentation: Increase dataset size by scraping more news articles to improve generalization.
  3. Advanced NLP Techniques: Implement models like BERT or GPT for improved classification accuracy.

User Interface

A web-based user interface was created using React on the front end and Django (FastAPI) on the back end. This interface allows users to:

  • Upload new articles for classification.
  • View classification results instantly on the dashboard.
  • Analyze model performance through real-time visualizations and feedback.

Confusion Confusion

License

This project is under the MIT License, and I’d be thrilled if you use and improve my work!

About

This project classifies news articles into categories using NLP and ML techniques. It features a user-friendly web interface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published