Skip to content

Applications and next steps

cjdd3b edited this page Jan 23, 2013 · 25 revisions

At the end of the day, our process did a pretty decent job coming up with results that matched CRPs -- even sometimes finding things that the CRP data got wrong. But the question still remains: So what? What good does it do to work with campaign data at the donor level as opposed to the contribution level? Is it really worth all the trouble?

Of course it is! Here are a few examples of applications that might derive from this process at the local, state and federal levels.

Potential applications

Recall that part of the motivation behind this project was to generalize the standardization of donor names across campaign finance datasets. The main fields in most campaign finance datasets -- local, state or federal -- look pretty much the same: donor name, recipient name, some location information, and often some info about occupation and employer. CRP does a great job cleaning up this data on the national level, and the National Institute for Money in State Politics does something similar for the states, but neither of them are going to be able to standardize your local city council's campaign finance records on demand. So there's one application right there.

But that still leaves a bigger question: What's the point of standardizing this stuff at all? Some of the most instructive and inspirational ideas I've heard along these lines came with a data mining contest we ran late last year. I was with the Center for Investigative Reporting at the time, and we teamed up with IRE to co-host the contest with Kaggle. The point was for data scientists and other non-journalist experts to look at a set of federal campaign finance data and see what kinds of cool analyses might be performed that reporters might not think of. You can see most of the entries here, but here are a few that stood out:

  • The winning entry, by Australia's own Nathaniel Ramm, proposed a tool called a behavior stability index for tracking whether and when a particular donor's giving patterns change over time.

  • Another entry proposed linking donors with Wikipedia pages to enrich donor information with useful metadata. The entry also proposed looking at networks and communities of donors, which could reveal interesting patterns.

  • Another proposed using statistical techniques to detect donor coordination. Although the method was proposed to find illegal coordination between candidates and Super PACs, it could also be adapted to reveal donors who tend to work together, which could lead to new and interesting stories.

Common among all of those ideas is the seemingly obvious notion that political influence is accrued by individuals and institutions -- not contributions. Unless we have a clear picture of what people and groups are doing, how their behavior changes, and what they get in return, the tracking of money's relationship to politics loses a lot of its power.

Done right, being able to standardize donors properly opens up opportunities to enable new analyses, build new visualizations and surface trends that would be impossible with simple contribution-level data alone.

Next steps

I think this was a pretty successful first attempt at automated donor standardization, but obviously no workflow is perfect -- particularly when it's making judgments based on data that can be flawed or incomplete. The next step for me is to clean up a few things around the edges to see if I can bump up the system's performance another point or so.

At a basic level, there are some easy preprocessing steps that should eliminate some of the more basic mistakes our classifier is making: deal with zero-padding on ZIP codes, for example, and improving aspects of the name parser we're using, such as nicknames and certain suffix placements. Knowing now that the CRP data we've been using to train our model contains a few errors of its own, a thorough review of the training data could also be useful.

On the machine learning front, an analysis of bias vs. variance in the model might also be helpful, although Random Forests are in some ways designed to prevent over and under fitting. A closer look at optimal feature combinations might also be interesting.

And finally, the big next step is to generalize this method to work with any campaign finance data

That's all for now! Further work I do to generalize this method will be made available here on Github. In the mean time, if you have any thoughts or questions, I'm at chase.davis@gmail.com.

Clone this wiki locally