Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfills & deprecating legacy tables #10

Merged
merged 59 commits into from
Nov 1, 2024
Merged

Backfills & deprecating legacy tables #10

merged 59 commits into from
Nov 1, 2024

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Sep 19, 2024

Closes HTTPArchive/data-pipeline#26
Closes HTTPArchive/data-pipeline#138

Deprecation plan


all.requests
2011-06 - 2015-12 definitions/output/all/backfill_summary_requests.js
2016-01 - 2022-02 definitions/output/all/backfill_requests.js

Sources used:

  • requests
    1st day of month: 2016-01-01 till 2022-06-01
    15th day of month: 2016-01-15 till 2018-12-15

  • response_bodies
    1st day of month: 2016-01-01 till 2020-11-01(mobile)
    15th day of month: 2016-01-15 till 2018-12-15

  • summary_requests
    1st day of month: 2011-06-01 till 2015-12-01
    15th day of month: 2011-06-15 till 2015-12-15

all.pages

  • 2016-01 - 2022-02 definitions/output/all/backfill_pages.js
  • 2011-06 - 2015-12 definitions/output/all/backfill_summary_pages.js

Sources used:

  • summary_pages
    1st day of month: 2011-06-01 till 2015-12-01
    15th day of month: 2011-06-15 till 2015-12-15

  • pages
    1st day of month: 2016-01-01 till 2022-06-01
    15th day of month: 2016-01-15 till 2018-12-15

Summary schema evolution:

SELECT 
  REGEXP_REPLACE(ddl, 'CREATE TABLE .+\n', '') AS schema,
  MIN(table_name) AS min,
  MAX(table_name) AS max,
  ARRAY_AGG(DISTINCT REGEXP_EXTRACT(table_name, r'\d+_\d+_\d+') ORDER BY REGEXP_EXTRACT(table_name, r'\d+_\d+_\d+') ASC),
  COUNT(1)
FROM summary_requests.INFORMATION_SCHEMA.TABLES
GROUP BY schema
ORDER BY min ASC

Known issues

Missing partitions/clusters

  • pages: 2016-03-15(mobile)
  • summary_pages: 2013-07-15, 2013-12-01, 2014-06-15(mobile), 2015-06-01(mobile), 2015-09-01(mobile)
  • summary_requests: 2013-12-01, 2013-07-15, 2014-06-15(mobile), 2015-06-01(mobile), 2015-09-01(mobile)
  • response_bodies: from 2020-11-01(desktop) till 2022-06-01(desktop)

Wherever the individual tables are missing the queries will fail and I'll need to re-run manually without a JOIN.

More issues related to historical data:

@max-ostapenko max-ostapenko changed the title Deprecating response_bodies requests backfill & deprecating response_bodies Sep 19, 2024
@max-ostapenko max-ostapenko marked this pull request as ready for review September 27, 2024 20:24
@max-ostapenko max-ostapenko marked this pull request as draft September 27, 2024 20:51
@max-ostapenko max-ostapenko mentioned this pull request Sep 27, 2024
4 tasks
@max-ostapenko max-ostapenko changed the title requests backfill & deprecating response_bodies Backfills & deprecating legacy tables Sep 27, 2024
@max-ostapenko max-ostapenko marked this pull request as ready for review October 20, 2024 14:29
@max-ostapenko
Copy link
Contributor Author

As we can't join requests and summary_requests tables I've recalculated the summary metrics using the UDF.

@max-ostapenko
Copy link
Contributor Author

max-ostapenko commented Oct 22, 2024

@tunetheweb thanks to a few hints in the comments I think I've resolved all the missing metrics for the backfill.

And added new dataset to the crawl pipeline as a new default. All the reports are dependents of crawl. The legacy tables + all dataset will be updated in parallel.

all.parsed_css will be a snapshot of crawl.parsed_css to avoid duplication.

Screenshot 2024-10-22 at 17 38 32

@max-ostapenko max-ostapenko merged commit 38d9f01 into main Nov 1, 2024
15 checks passed
@max-ostapenko max-ostapenko deleted the fiscal-owl branch November 1, 2024 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants