We're almost ready to do some machine learning! First, we need to turn our sentences into the type of feature vectors the algorithm we plan to work with expects. Jumping ahead a bit, the Sklearn
implementation of the algorithm we will use for unsupervised learning requires that the text be in bag of words form, which is the unique words in the text and the count of occurances of that word.
Let's take the data we just saved out and load it back into a dataframe so that we can do some analysis with it!
import pandas as pd
df = pd.read_csv("df_news_romance.csv")
df.head()
label | sentence | NN | JJ | |
---|---|---|---|---|
0 | news | ['The', 'Fulton', 'County', 'Grand', 'Jury', '... | 11 | 2 |
1 | news | ['The', 'jury', 'further', 'said', 'in', 'term... | 13 | 2 |
2 | news | ['The', 'September-October', 'term', 'jury', '... | 16 | 2 |
3 | news | ['``', 'Only', 'a', 'relative', 'handful', 'of... | 9 | 3 |
4 | news | ['The', 'jury', 'said', 'it', 'did', 'find', '... | 5 | 3 |
Then we print the first 5 rows of the sentence column in the spreadsheet:
df['sentence'].head()
0 ['The', 'Fulton', 'County', 'Grand', 'Jury', '...
1 ['The', 'jury', 'further', 'said', 'in', 'term...
2 ['The', 'September-October', 'term', 'jury', '...
3 ['``', 'Only', 'a', 'relative', 'handful', 'of...
4 ['The', 'jury', 'said', 'it', 'did', 'find', '...
Name: sentence, dtype: object
We preprocess our data using sklearn's text feature extraction tools. In particular, we use the CountVectorizer
which computes the frequency of each token in the document. We can strip out stop words (words that are so common they don't add to the data analysis, such as "the" and "a") using the stop_words
keyword argument. A keyword argument is an optional function parameter.
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(df['sentence'])
CountVectorizer
processes the text such that tf
is a sparse matrix containing the count of words in each document. A matrix is a table of numbers, and a sparse matrix is a table where most of those numbers are 0. tf
is mostly 0 because many words only appear in a handful of the many documents that make up our sample corpus.
One document in the Brown corpus is the following sentence:
Mrs. Robert O. Spurdle is chairman of the committee , which includes Mrs. James A. Moody , Mrs. Frank C. Wilkinson , Mrs. Ethel Coles , Mrs. Harold G. Lacy , Mrs. Albert W. Terry , Mrs. Henry M. Chance , 2d , Mrs. Robert O. Spurdle , Jr. , Mrs. Harcourt N. Trimble , Jr. , Mrs. John A. Moller , Mrs. Robert Zeising , Mrs. William G. Kilhour , Mrs. Hughes Cauffman , Mrs. John L. Baringer and Mrs. Clyde Newman .
Through the CountVectorizer
command, the stop words, punctuation, and very low frequency words have been removed. This yeilds the words and their counts, which are listed and also visualized in a word cloud below. The creation of this visualization is discussed in an appendix.
{'2d': 1, 'albert': 1, 'baringer': 1, 'cauffman': 1, 'chairman': 1, 'chance': 1, 'clyde': 1, 'coles': 1,
'committee': 1, 'ethel': 1, 'frank': 1, 'harcourt': 1, 'harold': 1, 'henry': 1, 'hughes': 1, 'includes': 1,
'james': 1, 'john': 2, 'jr': 2, 'kilhour': 1, 'lacy': 1, 'moller': 1, 'moody': 1, 'mrs': 15, 'newman': 1,
'robert': 3, 'spurdle': 2, 'terry': 1, 'trimble': 1, 'wilkinson': 1, 'william': 1, 'zeising': 1}