Skip to content
cjdd3b edited this page Jan 22, 2013 · 59 revisions

In the last section, we explained how we preprocessed the data -- breaking down the problem space, extracting features, and setting things up to find similar donors using a graph-based approach. This section will talk about the next step: using a Random Forest classifier to define the edges in our graph so we can cluster them later.

A quick review of our approach

Now is a good time to review that our approach to finding similar donors basically follows a three-step process:

  • Break our universe of campaign contributions into small groups that are likely to contain donations from the same people (in our example, we split them based on last names).
  • Compare all pairs of donations in each of those groups against each other and link them together if they appear to come from the same person.
  • Linked contributions within a single group should then represent the same donor. So our last step is to find clusters of linked contributions and assign them a unique donor ID.

This section focuses on the second step in that workflow: Looking at each of the contribution pairs within the groups we created in the last section and determining whether they are indeed from the same person. In visual terms, pretend each of the groups we created in the last step looks like this:

Single group

What we're doing here is linking together individual donations if they look like they're from the same person, like this:

Group with links

And the way we'll do that is by using simple machine learning.

Choosing a classifier

Relevant code: tests/classifier_diagnostics.py

For this project, our machine learning muscle is coming from scikit-learn -- a great Python library that saves us from having to implement any significant machine learning algorithms ourselves. It comes with all sorts of supervised learning algorithms, and Step 1 is figuring out which one is best for the task.

Now that we've prepped our data, the classification task is pretty straightforward. Given a pair of contributions, we want our classifier to make a simple binary choice: Are they from the same person or not? They make this determination based on the feature vectors we created earlier, which contain 15 possible dimensions on which two contributors might match: they have similar names; they come from the same ZIP code; their occupations and/or employers are similar; etc.

To start, we're going to enlist the help of five common algorithms to see which one works best. Specifically, we'll be trying decision trees, logistic regression, Gaussian Naive Bayes, Random Forests and support vector machines. And because our dataset has way more negative examples than positive ones, we'll be using precision, recall, and their combined measure, F1, as our accuracy metric.

Our job in this test is to see how often our classifier makes the same judgments as CRP's process -- something we can do because we're working with pre-classified CRP data, so we know the "right answer" for any given pair classification. We're going to use a technique called k-fold cross validation to see how our classifiers perform across different subsets of our data. Think of it as testing each classifier k times (in our case, 10), using different mixtures of test and training data, and then averaging the results. Here's how the results break down across the entire 100,000-record example set:

Classifier Precision Recall F1 Notes
Decision Tree 0.95 0.95 0.95
Logistic Regression 0.96 0.95 0.95 Using L2 regularization
Gaussian Naive Bayes 0.87 0.98 0.92
Support Vector Machine 0.94 0.96 0.95 Only tested on a 20,000-record set because of scaling issues
Random Forest 0.96 0.96 0.96 Using just 10 estimators for speed purposes. In this case, it turns out that more estimators have little effect.

Most of the classifiers did pretty well, but the winner is the Random Forest -- essentially an ensemble of decision trees that also happens to be the new hotness in a lot of machine learning circles these days. The nitty gritty about how it works is beyond the scope of this writeup, but you can read more about it on the web.

Going forward, we'll be using the Random Forest classifier to find our matching pairs.

Understanding the feature space

Relevant code: tests/classifier_diagnostics.py

Now that we've chosen our classifier, it helps to have some insight on how it interprets our data. A good place to start is understanding which features are the most useful for a given classification task. Like most of scikit-learn's classifiers, our Random Forest has a built-in ability to compute the importance of input features. Let's take a look:

Feature importances

It's pretty clear at first glance that a handful of features are providing the bulk of the predictive power to this model, while the others are just adding some value around the edges. In this case, the most important ones appear to be the contributor_name_similarity, same_gender, zip_sim and same_state functions. One of the features, last_name, provides literally no predictive power because our preprocessing already groups our data by last name, meaning there is literally no last name variance with in a group for the classifier to explain.

On top of being interesting, this knowledge can help guide some of our decision-making going forward. For example, zero-valued or minimally important features can usually be removed -- a step that's particularly useful if a given feature takes a lot of time to compute. Knowing these insights about your classifier can also help guide you toward new useful features to create. For instance, all of the ZIP code features in this set (which are apparently pretty useful) use string similarity as a proxy for distance. The ZIP codes 77019 and 77021 are both pretty close to each other, and that numbering scheme holds up pretty well across ZIP codes everywhere. But for the situations where people file contributions from multiple addresses, such as home and work, another useful feature might be a literal distance measurement between ZIP codes, which could clear up some edge cases.

A quick word about probabilistic matching

Most of the classifiers built into scikit-learn, including the Random Forest we're using, have the ability to return their classification results with some sort of probability attached. In other words, instead of saying two contributions are from the same donor, they can say the classifier is 90 percent confident that's the case. This is a hugely powerful tool for us because it allows us to handle different batches of results in different ways, depending on the classifier's certainty.

In the test batch of data we've been working with, our classifier comes back with the exact same answer as CRP about 99 percent of the time when its certainty is at least 0.9 -- almost a perfect match. And as you'll see later, in many of the instances where the classifier differs from CRP, the classifier is actually right. Below the 0.9 threshold, the number of results that don't match jumps to above 18 percent. In other words, we might feel comfortable taking the classifier at its word when it's almost positive, but if it's having some doubts we might want to review those answers by hand.

Just to continue the point a little further, results in which the classifier is at least 90 percent confident compose about 98.3 percent of our dataset. The other 1.7 percent -- or about 6,500 records -- is small enough that manual review would be totally doable. One disadvantage of the decision tree approach is that the probabilities it returns aren't very granular, so it's harder to segment things in a way that's useful. But still, a little probability is better than total certainty.

Reviewing the classifier's mistakes

So now the classifier has run through our nearly 400,000 potential matches and made a determination about each one: whether it matches CRP's results or it doesn't, and with what certainty. Just like human beings, however, it isn't perfect. So the next step is to look at where the classifier and CRP's results differ, so we can gain some stronger intuition about where it's going wrong.

Remember that the classifier's judgment matches CRP's with an F1 score of 97 percent, which is pretty damn good. That means there are only about 12,000 cases where the two judgments don't line up. Looking at a random sample of XXX records from that subset, these are the types of errors that come up most often:

Clone this wiki locally