Skip to content
cjdd3b edited this page Jan 22, 2013 · 59 revisions

In the last section, we explained how we preprocessed the data -- breaking down the problem space, extracting features, and setting things up to find similar donors using a graph-based approach. This section will talk about the next step: using a Random Forest classifier to define the edges in our graph so we can cluster them later.

A quick review of our approach

Now is a good time to review that our approach to finding similar donors basically follows a three-step process:

  • Break our universe of campaign contributions into small groups that are likely to contain donations from the same people (in our example, we split them based on last names).
  • Compare all pairs of donations in each of those groups against each other and link them together if they appear to come from the same person.
  • Linked contributions within a single group should then represent the same donor. So our last step is to find clusters of linked contributions and assign them a unique donor ID.

This section focuses on the second step in that workflow: Looking at each of the contribution pairs within the groups we created in the last section and determining whether they are indeed from the same person. In visual terms, pretend each of the groups we created in the last step looks like this:

Single group

What we're doing here is linking together individual donations if they look like they're from the same person, like this:

Group with links

And the way we'll do that is by using simple machine learning.

Choosing a classifier

Relevant code: tests/classifier_diagnostics.py

For this project, our machine learning muscle is coming from scikit-learn -- a great Python library that saves us from having to implement any significant machine learning algorithms ourselves. It comes with all sorts of supervised learning algorithms, and Step 1 is figuring out which one is best for the task.

Now that we've prepped our data, the classification task is pretty straightforward. Given a pair of contributions, we want our classifier to make a simple binary choice: Are they from the same person or not? They make this determination based on the feature vectors we created earlier, which contain 15 possible dimensions on which two contributors might match: they have similar names; they come from the same ZIP code; their occupations and/or employers are similar; etc.

To start, we're going to enlist the help of five common algorithms to see which one works best. Specifically, we'll be trying decision trees, logistic regression, Gaussian Naive Bayes, random forests and support vector machines. And because our dataset has way more negative examples than positive ones, we'll be using precision, recall, and their combined measure, F1, as our accuracy metric.

Our job in this test is to see how often our classifier makes the same judgments as CRP's process -- something we can do because we're working with pre-classified CRP data, so we know the "right answer" for any given pair classification. We're going to use a technique called k-fold cross validation to see how our classifiers perform across different subsets of our data. Think of it as testing each classifier k times (in our case, 10), using different mixtures of test and training data, and then averaging the results. Here's how the results break down across the entire 100,000-record example set:

Classifier Precision Recall F1 Notes
Decision Tree 0.95 0.95 0.95
Logistic Regression 0.96 0.95 0.95 Using L2 regularization
Gaussian Naive Bayes 0.87 0.98 0.92
Support Vector Machine 0.94 0.96 0.95 Only tested on a 20,000-record set because of scaling issues
Random Forest 0.96 0.96 0.96 Using just 10 estimators for speed purposes. In this case, it turns out that more estimators have little effect.

Most of the classifiers did pretty well, but the winner is the Random Forest -- essentially an ensemble of decision trees that also happens to be the new hotness in a lot of machine learning circles these days. The nitty gritty about how it works is beyond the scope of this writeup, but you can read more about it on the web.

Going forward, we'll be using the Random Forest classifier to find our matching pairs.

Understanding the feature space

Now that we've chosen our classifier, we could, if we wanted to, go back and refine our process a little bit. In testing, we used all of the 15 features we developed to train and test our decision tree -- and it did pretty well. But some features are more useful than others.

Like most of scikit-learn's classifiers, our decision tree has a built-in ability to compute the importance of input features, which lets us make some judgments about which ones to leave in and which ones to take out. Here's how the features look in our existing model:

Feature importances

Whoa. It's pretty clear features 9 and 11 are supplying the bulk of the predictive power to this model, while the rest are just adding some value around the edges. In this case, because of the way features are fed into the vectors, the feature numbers correspond to the order in which the feature functions appear in learn/features.py. Number 9 is matching same_first_name and 11 is same_gender. Meanwhile, four features -- same_zip_region, same_city, same_zip_sectionalcenter and same_zip_cityarea -- probably because most of the geographic information they contain is implied by the zip_sim function, which is also a strong contributor to the model.

Zero-valued features can usually be removed, and doing so can be a good way to optimize your code -- particularly if those features are expensive to calculate. But they aren't doing any harm in this case, so we'll leave them in for simplicity's sake.

A quick word about probabilistic matching

Most of the classifiers built into scikit-learn, including the decision tree we're using, have the ability to return their classification results with some sort of probability attached. In other words, instead of saying two contributions are from the same donor, they can say the classifier is 90 percent confident that's the case. This is a hugely powerful tool for us because it allows us to handle different batches of results in different ways, depending on the classifier's certainty.

In the test batch of data we've been working with, our classifier comes back with the exact same answer as CRP about 99 percent of the time when its certainty is at least 0.9 -- almost a perfect match. And as you'll see later, in many of the instances where the classifier differs from CRP, the classifier is actually right. Below the 0.9 threshold, the number of results that don't match jumps to above 18 percent. In other words, we might feel comfortable taking the classifier at its word when it's almost positive, but if it's having some doubts we might want to review those answers by hand.

Just to continue the point a little further, results in which the classifier is at least 90 percent confident compose about 98.3 percent of our dataset. The other 1.7 percent -- or about 6,500 records -- is small enough that manual review would be totally doable. One disadvantage of the decision tree approach is that the probabilities it returns aren't very granular, so it's harder to segment things in a way that's useful. But still, a little probability is better than total certainty.

Reviewing the classifier's mistakes

So now the classifier has run through our nearly 400,000 potential matches and made a determination about each one: whether it matches CRP's results or it doesn't, and with what certainty. Just like human beings, however, it isn't perfect. So the next step is to look at where the classifier and CRP's results differ, so we can gain some stronger intuition about where it's going wrong.

Remember that the classifier's judgment matches CRP's with an F1 score of 97 percent, which is pretty damn good. That means there are only about 12,000 cases where the two judgments don't line up. Looking at a random sample of XXX records from that subset, these are the types of errors that come up most often:

Clone this wiki locally