classify news into five categories: business, entertainment, politics, sport, and tech.
Introduction:
The task of the project was to classify news articles into five categories: business, entertainment, politics, sport, and tech.
The BBC News dataset was used for this task.
Preprocessing:
The dataset was loaded using the datasets library, and the text was cleaned by removing punctuations and stopwords.
The text was then tokenized using the Hugging Face tokenizer and the labels were one-hot encoded. The tokenized articles were split into training and validation sets, and the model was fine-tuned on the training set.
Architecture and Fine-tuning:
The pre-trained BERT model was used as the architecture for the text classification task.
The model was fine-tuned on the tokenized articles using the Adam optimizer, a learning rate of 2e-5, and a batch size of 32. The model was trained for 5 epochs, and a checkpoint was saved after each epoch.
Evaluation:
The trained model was evaluated on the validation set using accuracy, precision, recall, and F1-score metrics.
The model achieved an accuracy of 97.7%, precision of 97.9%, recall of 97.7%, and an F1-score of 97.7%.
Discussion:
The model achieved high accuracy on the validation set, indicating that it is performing well on this particular dataset.
However, it is possible that the model may not perform as well on other datasets or real-world data. Possible ways to improve the model could be to increase the size of the training set, use a different pre-trained model architecture, or try different hyperparameters.
Sample Predictions:
Here are a few sample predictions made by the trained model:
Text: "The new iPhone is set to be released next month."
Predicted label: tech
Text: "The government has proposed a new tax policy."
Predicted label: politics
Text: "The latest movie from Steven Spielberg has received mixed reviews."
Predicted label: entertainment
Text: "The Manchester United soccer team won the game yesterday."
Predicted label: sport
Text: "The company has announced record profits for the year."
Predicted label: business
These predictions demonstrate that the model is able to accurately classify news articles into their corresponding categories.