News Groups Classifier

Example of refactoring a procedural approach for feature extraction on a text classification problem to an object-oriented (OOP) approach for solving the same problem. More specifically, the code in this repo demonsrates an OOP approach applied to a text classification problem for classifying posts made by users to newsgroups.The code for the procedural approach is provided by the Scikit-Learn library. The newsgroups data is provided by the (now classic) 20 newsgroups dataset.

Setup

This code was built and tested with Python v3.7.0 and pipenv v2018.10.9. The instructions for running this code assume pipenv is the tool used for spinning up virtual environments, but pip works great, too, so just adapt the instructions as necessary.

Dependencies

NumPy: v1.5.2 or higher
Scikit-Learn: v0.20.0 or higher
SciPy: v1.1.0 or higher

Run

The quickest way to get running is to git clone and cd into this repo's root directory and execute the following at the command line.

Install dependencies

$ pipenv install

Activate virtual environment

$ pipenv shell

Run script

$ python train-classifier.py

Explore

To more easily explore the output from the classifier run the script in Python's interactive mode with the following command:

$ python -i train-classifier.py

Fit Classifier on Other Data

The following code demonstrates how easy it is to train on other data.

# import NewsGroupsClassifier
from NewsGroupsClassifiers import NewsGroupsClassifier

# pass in pipeline object, parameter dict to customize fit
clf = NewsGroupsClassifier() 

# train_examples, train_labels are text training data
clf.fit(train_examples, train_labels) 

# view summary report
clf.summary

Future Improvements

Encapsulate training data in a data object and pass to classifier
Better error and exception handling (log exceptions and stack traces to file on disk)
Abstract feature extraction pipeline components into own object (use FeatureUnion to combine extraction and classifier pipeline components)

Authors

Keith Dowd <keith.dowd at gmail dot com>

Acknowledgements

Thank you kindly for allowing me the opportunity to participate in this code challenge! 😄

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
NewsGroupsClassifier		NewsGroupsClassifier
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
classification pipeline diagram.jpg		classification pipeline diagram.jpg
requirements.txt		requirements.txt
train-classifier.py		train-classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Groups Classifier

Setup

Dependencies

Run

Explore

Fit Classifier on Other Data

Future Improvements

Authors

Acknowledgements

About

Releases

Packages

Languages

keithdowd/newsgroups-classifier

Folders and files

Latest commit

History

Repository files navigation

News Groups Classifier

Setup

Dependencies

Run

Explore

Fit Classifier on Other Data

Future Improvements

Authors

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages