Skip to content
cjdd3b edited this page Jan 25, 2013 · 59 revisions

In the last section, we explained how to preprocess the data -- breaking down the problem space, extracting features, and setting things up to find similar donors using a graph-based approach. This section will talk about the next step: using a Random Forest classifier to define the edges in our graph so we can cluster them later.

A quick review of our approach

Now that preprocessing is done, our approach to finding similar donors follows a three-step process:

  • Break our universe of campaign contributions into small groups that are likely to contain donations from the same people (in our example, we split them based on last names).
  • Compare all pairs of donations in each of those groups against each other and link them together if they appear to come from the same person.
  • Linked contributions within a single group should then represent the same donor. So our last step is to find clusters of linked contributions and assign them a unique donor ID.

This section focuses on the second step in that workflow: Looking at each of the contribution pairs within the groups we created in the last section and determining whether they are indeed from the same person. In visual terms, pretend each of the groups we created in the last step looks like this:

Single group

What we're doing in this step is linking together individual donations if they look like they're from the same person, like this:

Group with links

And the way we'll do that is by using simple machine learning.

Choosing a classifier

Relevant code: tests/classifier_diagnostics.py

For this project, our machine learning muscle is coming from scikit-learn -- a great Python library that saves us from having to implement any significant machine learning algorithms ourselves. It comes with all sorts of supervised learning algorithms, and Step 1 is figuring out which one is best for the task.

Now that we've prepped our data, the classification task is pretty straightforward. Given a pair of contributions, we want our classifier to make a simple binary choice: Are they from the same person or not? It will make this determination based on the feature vectors we created earlier, which contain 15 possible dimensions on which two contributors might match: they have similar names; they come from the same ZIP code; their occupations and/or employers are similar; etc.

To start, we're going to enlist the help of five common algorithms to see which one works best. Specifically, we'll be trying decision trees, logistic regression, Gaussian Naive Bayes, Random Forests and support vector machines. And because our dataset has way more negative examples than positive ones, we'll be using precision, recall, and their combined measure, F1, as our accuracy metric.

Our job in this test is to see how often our classifier makes the same judgments as CRP's process -- something we can do because we're working with pre-classified CRP data, so we know the "right answer" for any given pair classification. We're going to use a technique called k-fold cross validation to see how our classifiers perform across different subsets of our data. Think of it as testing each classifier k times (in our case, 10), using different mixtures of test and training data, and then averaging the results. Here's how the results break down across the entire 100,000-record example set:

Classifier Precision Recall F1 Notes
Decision Tree 0.95 0.95 0.95
Logistic Regression 0.96 0.95 0.95 Using L2 regularization
Gaussian Naive Bayes 0.87 0.98 0.92
Support Vector Machine 0.94 0.96 0.95 Only tested on a 20,000-record set because of scaling issues
Random Forest 0.96 0.96 0.96 Using just 10 estimators for speed purposes. In this case, it turns out that more estimators have little effect.

Most of the classifiers did pretty well, but the winner is the Random Forest -- essentially an ensemble of decision trees that also happens to be the new hotness in a lot of machine learning circles these days. The nitty gritty about how it works is beyond the scope of this writeup, but you can read more about it on the web.

Going forward, we'll be using the Random Forest classifier to find our matching pairs.

Understanding the feature space

Relevant code: tests/classifier_diagnostics.py

Now that we've chosen our classifier, it helps to have some insight on how it interprets our data. A good place to start is understanding which features are the most useful for a given classification task. Like most of scikit-learn's classifiers, our Random Forest has a built-in ability to compute the importance of input features. Let's take a look:

Feature importances

It's pretty clear at first glance that a handful of features are providing the bulk of the predictive power to this model, while the others are just adding some value around the edges. In this case, the most important ones appear to be the contributor_name_similarity, same_gender, zip_sim and same_state functions. One of the features, last_name, provides no predictive power because our preprocessing already groups our data by last name, meaning there is literally no last name variance within a group for the classifier to explain.

On top of being interesting, this knowledge can help guide some of our decision-making going forward. For example, zero-valued or minimally important features can usually be removed -- a step that's particularly useful if a given feature takes a lot of time to compute. Knowing these insights about your classifier can also help guide you toward new useful features to create. For instance, all of the ZIP code features in this set (which are apparently pretty useful) use string similarity as a proxy for distance. The ZIP codes 77019 and 77021 are both pretty close to each other, and that numbering scheme holds up pretty well across ZIP codes everywhere. But for the situations where people file contributions from multiple addresses, such as home and work, another useful feature might be a literal distance measurement between ZIP codes, which could clear up some edge cases.

A quick word about probabilistic matching

Most of the classifiers built into scikit-learn, including the Random Forest we're using, have the ability to return their classification results with a probability attached. In other words, instead of saying two contributions are definitely from the same donor, the classifier can say it is 90, or 80, or 70 percent sure that's the case. This is a hugely powerful tool for us because it allows us to handle different batches of results in different ways, depending on the classifier's certainty.

Indeed, the accuracy of our classifier drops off substantially as it becomes less certain. You can see that pattern in the following table:

Confidence Total observations (match records) Precision Recall F1
0.9 or higher 361,708 0.98 0.99 0.98
0.8 - 0.9 2,813 0.84 0.86 0.85
0.7 - 0.8 1,407 0.81 0.73 0.77
Lower than 0.7 1,934 0.7 0.49 0.58

The classifier is virtually certain about the overwhelming majority of our pairs (0.9 and above), and its F1 in that section, 0.98, is a couple points higher than its performance on the dataset at large. But below that its accuracy falls off a cliff, bottoming out around 0.58 when its confidence score is less than 0.7.

The good news is that the number of records in which our classifier is less than 90 percent confident represents only about 1 percent of our dataset: about 6,100 pairs in total. The number is small enough that we could easily farm those records out for manual review, which would likely bring our overall accuracy up a point or two.

Turning matches into donors

So now the classifier has run through our nearly 400,000 potential matches and made a determination about each one: whether it matches CRP's results or it doesn't, and with what certainty. In other words, we now have a graph that looks like this for every single last name group in our dataset:

Group with links

The next step, which we'll look at in the next section, is to find clusters of linked nodes and label them as a single donor.

Clone this wiki locally