-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfills & deprecating legacy tables #10
Conversation
response_bodies
requests
backfill & deprecating response_bodies
requests
backfill & deprecating response_bodies
As we can't join requests and summary_requests tables I've recalculated the summary metrics using the UDF. |
@tunetheweb thanks to a few hints in the comments I think I've resolved all the missing metrics for the backfill. And added new dataset to the crawl pipeline as a new default. All the reports are dependents of
|
Closes HTTPArchive/data-pipeline#26
Closes HTTPArchive/data-pipeline#138
Deprecation plan
all.requests
2011-06 - 2015-12
definitions/output/all/backfill_summary_requests.js
2016-01 - 2022-02
definitions/output/all/backfill_requests.js
Sources used:
requests
1st day of month: 2016-01-01 till 2022-06-01
15th day of month: 2016-01-15 till 2018-12-15
response_bodies
1st day of month: 2016-01-01 till 2020-11-01(mobile)
15th day of month: 2016-01-15 till 2018-12-15
summary_requests
1st day of month: 2011-06-01 till 2015-12-01
15th day of month: 2011-06-15 till 2015-12-15
all.pages
definitions/output/all/backfill_pages.js
definitions/output/all/backfill_summary_pages.js
Sources used:
summary_pages
1st day of month: 2011-06-01 till 2015-12-01
15th day of month: 2011-06-15 till 2015-12-15
pages
1st day of month: 2016-01-01 till 2022-06-01
15th day of month: 2016-01-15 till 2018-12-15
Summary schema evolution:
Known issues
Missing partitions/clusters
pages
: 2016-03-15(mobile)summary_pages
: 2013-07-15, 2013-12-01, 2014-06-15(mobile), 2015-06-01(mobile), 2015-09-01(mobile)summary_requests
: 2013-12-01, 2013-07-15, 2014-06-15(mobile), 2015-06-01(mobile), 2015-09-01(mobile)response_bodies
: from 2020-11-01(desktop) till 2022-06-01(desktop)Wherever the individual tables are missing the queries will fail and I'll need to re-run manually without a JOIN.
More issues related to historical data: