How to test the correctness of the data? #132

adborden · 2018-07-26T15:20:11Z

Conversation about our testing strategy. How do we know the data is correct? We want to know when the data is inaccurate. How might we create assurances that our calculations are correct?

In some cases, there is an error in the filings, but that is the data we have. Is there anything we can do for this case?

adborden · 2018-07-26T15:24:54Z

Tiered approach

There's several steps in our ETL, from downloading to data cleaning to import, to hepler views, to calculations for different entities. We could implement assertions at each level (tier) in the ETL process to catch errors early.

I'm not sure if we have a sense of at what level our errors are coming from. I supsect it's in the last step of calculations, which is the bulk of the process. I wonder if we can split it into smaller chucks that are simpler and easier to make assertions against.

adborden · 2018-07-26T15:27:26Z

Sentinels

We have expected data/calculations and compare them to the actual data/calculations. This is similar to what we were manually doing--diffing the build output and looking or unexpected changes. This was sometimes hard because when the data updates, the numbers change so it's hard to tell that a mismatch is from data vs a change in the calculation.

One problem with sentinel checks is that it doesn't tell you what the error is, only that there is an error.

sfdoran · 2018-07-30T20:58:28Z

Thanks @adborden for starting this conversation.

One place we could create an automated comparison would be to compare the total when each filing period is summed against the year-to-date total on the most recent filing plus any subsequent 24-hour filings.

Once source for differences may be that OpenDisclosure sums the total contributions and total expenditures for each filing period. If a committee edits prior transactions and then doesn't amend the prior filings those totals won't be updated in the data for the earlier periods, but it will affect the overall total reported on subsequent filings.

On filer error, we do have a disclaimer in our FAQ. But some problems aren't a genuine filer error, but rather difficulties caused by the way the data is reported. I think a very generic statement is better with a link to our note stating we don't clean the data.

tdooner · 2018-07-31T00:08:45Z

Agreed, there are a lot of different things we can/should be testing here.

The first tests, which I just merged yesterday in #127 , seek to ensure that our calculations match our mental models of what is happening. We do this by creating trivial, static, test cases that are much easier to reason about. This is purely to test that our code does what we think it does.

I'd like to propose that we create a distinction between "tests" and filer QA scripts that can detect filer error.

For the "tests" that we write, the goal should be that for any individual change to OpenDisclosure, all of the tests are passing, which indicates that there are no unexpected changes in behavior.

For filer QA, we can do whatever we want. If there are particular things we want to check, and report on them somehow, sure, let's do that. But it's not in our power to fix QA issues, so I'd love to separate them conceptually from "tests".

Right now, IMHO, our priority should be to add more test cases, especially as we hit buggy times of the filing season (i.e. when 496/497 data needs to be de-duplicated). When we get a good set of tests that make sure that the numbers we're calculating for ballot measures and candidates are right, then that should free up Suzanne from having to manually check all the numbers all the time. That'd be nice, wouldn't it? 😎

adborden · 2018-07-31T00:26:08Z

Quoting Tom Dooner (2018-07-30 17:08:46)

I'd like to propose that we create a distinction between "tests" and filer QA scripts that can detect filer error.

Great idea!

…

-- Aaron D Borden Human and hacker

tdooner · 2018-08-01T01:12:50Z

QA wishlist:

List of Ballot Measure Committees (and which ballot measures they contribute to) that aren't on the spreadsheet
Which Filer IDs are re-used multiple times - i.e. 1364564 is both a recipient committee ("Lift Up Oakland for better wages") and a Ballot Measure committee ("Committee to Protect Oakland Renters - Yes on Measure JJ")

sfdoran · 2018-09-05T16:32:48Z

@tdooner I can get those lists together for you. Just committees active in the 2016 election and forward, right?

adborden added type/question skill/ruby skill/campaign-finance skill/build-infrastructure labels Jul 26, 2018

adborden mentioned this issue Jul 26, 2018

Add test suite and exclude in-kind IE's to candidates #127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to test the correctness of the data? #132

How to test the correctness of the data? #132

adborden commented Jul 26, 2018 •

edited

Loading

adborden commented Jul 26, 2018

adborden commented Jul 26, 2018

sfdoran commented Jul 30, 2018

tdooner commented Jul 31, 2018 •

edited

Loading

adborden commented Jul 31, 2018 via email

tdooner commented Aug 1, 2018 •

edited

Loading

sfdoran commented Sep 5, 2018

How to test the correctness of the data? #132

How to test the correctness of the data? #132

Comments

adborden commented Jul 26, 2018 • edited Loading

adborden commented Jul 26, 2018

Tiered approach

adborden commented Jul 26, 2018

Sentinels

sfdoran commented Jul 30, 2018

tdooner commented Jul 31, 2018 • edited Loading

adborden commented Jul 31, 2018 via email

tdooner commented Aug 1, 2018 • edited Loading

sfdoran commented Sep 5, 2018

adborden commented Jul 26, 2018 •

edited

Loading

tdooner commented Jul 31, 2018 •

edited

Loading

tdooner commented Aug 1, 2018 •

edited

Loading