Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to test the correctness of the data? #132

Open
adborden opened this issue Jul 26, 2018 · 7 comments
Open

How to test the correctness of the data? #132

adborden opened this issue Jul 26, 2018 · 7 comments

Comments

@adborden
Copy link
Member

adborden commented Jul 26, 2018

Conversation about our testing strategy. How do we know the data is correct? We want to know when the data is inaccurate. How might we create assurances that our calculations are correct?

In some cases, there is an error in the filings, but that is the data we have. Is there anything we can do for this case?

@adborden
Copy link
Member Author

Tiered approach

There's several steps in our ETL, from downloading to data cleaning to import, to hepler views, to calculations for different entities. We could implement assertions at each level (tier) in the ETL process to catch errors early.

I'm not sure if we have a sense of at what level our errors are coming from. I supsect it's in the last step of calculations, which is the bulk of the process. I wonder if we can split it into smaller chucks that are simpler and easier to make assertions against.

@adborden
Copy link
Member Author

Sentinels

We have expected data/calculations and compare them to the actual data/calculations. This is similar to what we were manually doing--diffing the build output and looking or unexpected changes. This was sometimes hard because when the data updates, the numbers change so it's hard to tell that a mismatch is from data vs a change in the calculation.

One problem with sentinel checks is that it doesn't tell you what the error is, only that there is an error.

@sfdoran
Copy link

sfdoran commented Jul 30, 2018

Thanks @adborden for starting this conversation.

One place we could create an automated comparison would be to compare the total when each filing period is summed against the year-to-date total on the most recent filing plus any subsequent 24-hour filings.

Once source for differences may be that OpenDisclosure sums the total contributions and total expenditures for each filing period. If a committee edits prior transactions and then doesn't amend the prior filings those totals won't be updated in the data for the earlier periods, but it will affect the overall total reported on subsequent filings.

On filer error, we do have a disclaimer in our FAQ. But some problems aren't a genuine filer error, but rather difficulties caused by the way the data is reported. I think a very generic statement is better with a link to our note stating we don't clean the data.

@tdooner
Copy link
Member

tdooner commented Jul 31, 2018

Agreed, there are a lot of different things we can/should be testing here.

The first tests, which I just merged yesterday in #127 , seek to ensure that our calculations match our mental models of what is happening. We do this by creating trivial, static, test cases that are much easier to reason about. This is purely to test that our code does what we think it does.

I'd like to propose that we create a distinction between "tests" and filer QA scripts that can detect filer error.

For the "tests" that we write, the goal should be that for any individual change to OpenDisclosure, all of the tests are passing, which indicates that there are no unexpected changes in behavior.

For filer QA, we can do whatever we want. If there are particular things we want to check, and report on them somehow, sure, let's do that. But it's not in our power to fix QA issues, so I'd love to separate them conceptually from "tests".

Right now, IMHO, our priority should be to add more test cases, especially as we hit buggy times of the filing season (i.e. when 496/497 data needs to be de-duplicated). When we get a good set of tests that make sure that the numbers we're calculating for ballot measures and candidates are right, then that should free up Suzanne from having to manually check all the numbers all the time. That'd be nice, wouldn't it? 😎

@adborden
Copy link
Member Author

adborden commented Jul 31, 2018 via email

@tdooner
Copy link
Member

tdooner commented Aug 1, 2018

QA wishlist:

  • List of Ballot Measure Committees (and which ballot measures they contribute to) that aren't on the spreadsheet
  • Which Filer IDs are re-used multiple times - i.e. 1364564 is both a recipient committee ("Lift Up Oakland for better wages") and a Ballot Measure committee ("Committee to Protect Oakland Renters - Yes on Measure JJ")

@sfdoran
Copy link

sfdoran commented Sep 5, 2018

@tdooner I can get those lists together for you. Just committees active in the 2016 election and forward, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants