Example of refactoring a procedural approach for feature extraction on a text classification problem to an object-oriented (OOP) approach for solving the same problem. More specifically, the code in this repo demonsrates an OOP approach applied to a text classification problem for classifying posts made by users to newsgroups.The code for the procedural approach is provided by the Scikit-Learn
library. The newsgroups data is provided by the (now classic) 20 newsgroups dataset.
This code was built and tested with Python v3.7.0
and pipenv v2018.10.9
. The instructions for running this code assume pipenv
is the tool used for spinning up virtual environments, but pip
works great, too, so just adapt the instructions as necessary.
NumPy
: v1.5.2 or higherScikit-Learn
: v0.20.0 or higherSciPy
: v1.1.0 or higher
The quickest way to get running is to git clone
and cd
into this repo's root directory and execute the following at the command line.
Install dependencies
$ pipenv install
Activate virtual environment
$ pipenv shell
Run script
$ python train-classifier.py
To more easily explore the output from the classifier run the script in Python's interactive mode with the following command:
$ python -i train-classifier.py
The following code demonstrates how easy it is to train on other data.
# import NewsGroupsClassifier
from NewsGroupsClassifiers import NewsGroupsClassifier
# pass in pipeline object, parameter dict to customize fit
clf = NewsGroupsClassifier()
# train_examples, train_labels are text training data
clf.fit(train_examples, train_labels)
# view summary report
clf.summary
- Encapsulate training data in a data object and pass to classifier
- Better error and exception handling (log exceptions and stack traces to file on disk)
- Abstract feature extraction pipeline components into own object (use
FeatureUnion
to combine extraction and classifier pipeline components)
- Keith Dowd <keith.dowd at gmail dot com>
- Thank you kindly for allowing me the opportunity to participate in this code challenge! 😄