Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Quality Tolerance? #95

Open
FStephenQuaratiello opened this issue Jan 24, 2022 · 3 comments
Open

Data Quality Tolerance? #95

FStephenQuaratiello opened this issue Jan 24, 2022 · 3 comments

Comments

@FStephenQuaratiello
Copy link

Hi,

I've been noticing a slight (~1%) discrepancy between the number of records imported to BigQuery with this tool, and the number of requests reported by the Cloudflare GraphQL API for a given time period. For example, the GraphQL API reports 46,532 requests in a given hour, but in BigQuery, there are only 45,736 records with an EdgeStartTimestamp in that hour. A small difference, to be sure, but a noticeable one.

Is this within expectations? And is there a better way to measure the health/quality of data imported by this tool?

@shagamemnon
Copy link
Contributor

Hey @FStephenQuaratiello would you mind providing the GraphQL query and the BigQuery SQL query that you ran so we can investigate further?

@FStephenQuaratiello
Copy link
Author

FStephenQuaratiello commented Jan 26, 2022

Sure thing:

BigQuery query: '''SELECT COUNT(*), EXTRACT(HOUR from EdgeStartTimestamp) AS hour FROM [TABLE] WHERE EdgeStartTimestamp > TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND EdgeStartTimestamp < TIMESTAMP(CURRENT_DATE()) AND ARRAY_TO_STRING(ARRAY_REVERSE([ ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(1)], ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(2)] ]), ".") = '%s' GROUP BY hour ORDER BY hour'''

GraphQL query: """ query { viewer { zones(filter: {zoneTag: "%s"} ) { httpRequests1hGroups(limit:24, filter:{date: "%s"}) { sum { requests } dimensions { datetime } } } } } """

@victor-perov
Copy link

Hey @FStephenQuaratiello!
Thanks for sharing your queries.
I'm not particularly familiar with this tool, but I can help with a GQL part.

httpRequests1hGroups represents hourly aggregated eyeball requests. Therefore, if you want to compare with other sources, you should also make sure that you're counting eyeball requests.
On top of it, if you're using "today" in your query, then I would expect that the last hour would not be "full" because the aggregation nature assumes buffering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants