Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Class Labels: 5 (business, entertainment, politics, sport, tech)
-
dataset/data_files: Data folders each containing several news txt files
-
dataset/dataset.csv: csv file containing "news" and "type" as columns. "news" column represent news article and "type" represents news category among business, entertainment, politics, sport, tech.
-
model/get_data.py: To gather all txt files into one csv file contianing two columns("news","type"). After successfull execution it will create dataset.csv file in dataset folder.
-
model/model.py: preprocessing, tf-idf feature extraction and model buildind and evaluation stuff
-
model/test.ipynb: jupyter notebook
Divided the feature extracted dataset into two parts train and test set. Train set contains 1780 examples and Test set contains 445 examples.
Below table shows the result on test set
Accuracy | Value |
---|---|
Kappa | 0.9461 |
Accuracy | 0.9573 |