Datasets and schema updates (#15)

* pruned summary * custom metrics split * cleaned up payload * summary pruned * structs * queries update * update schemas * update schema * examples updated * types cleanup * crawl and page ids removed * page ids removed from metadata * underscore * doc updates * image updates * 1 week crawl queue * query formatting * title fix * query result updates * routines updates * formatting * move image
HTTPArchive · Nov 20, 2024 · a61dc0b · a61dc0b
1 parent 49ddadf
commit a61dc0b
Show file tree

Hide file tree

Showing 37 changed files with 25,141 additions and 24,821 deletions.
diff --git a/astro.config.mjs b/astro.config.mjs
@@ -26,11 +26,11 @@ export default defineConfig({
         }
       ],
       social: {
-        github: 'https://github.com/rviscomi/har.fyi',
+        github: 'https://github.com/HTTPArchive/har.fyi',
         twitter: 'https://twitter.com/HTTPArchive',
       },
       editLink: {
-        baseUrl: 'https://github.com/rviscomi/har.fyi/edit/main/'
+        baseUrl: 'https://github.com/HTTPArchive/har.fyi/edit/main/'
       },
       sidebar: [
         {
@@ -40,7 +40,7 @@ export default defineConfig({
             { label: 'Minimizing query costs', link: '/guides/minimizing-costs/' },
             { label: 'Guided tour', link: '/guides/guided-tour/' },
             { label: 'Release cycle', link: '/guides/release-cycle/' },
-            { label: 'Migrate queries to `all` dataset', link: '/guides/migrating-to-all-dataset/' },
+            { label: 'Migrate queries to `crawl` dataset', link: '/guides/migrating-to-crawl-dataset/' },
           ],
         },
         {

diff --git a/src/content/docs/guides/bigquery-httparchive-dataset-pinned.png b/src/content/docs/guides/bigquery-httparchive-dataset-pinned.png
diff --git a/src/content/docs/guides/bigquery-httparchive-pinned.png b/src/content/docs/guides/bigquery-httparchive-pinned.png
diff --git a/src/content/docs/guides/bigquery-pages.png b/src/content/docs/guides/bigquery-pages.png
diff --git a/src/content/docs/guides/bigquery-query-in-a-new-tab.png b/src/content/docs/guides/bigquery-query-in-a-new-tab.png
diff --git a/src/content/docs/guides/bigquery-summary_pages.png b/src/content/docs/guides/bigquery-summary_pages.png
diff --git a/src/assets/bq-preview.webp → src/content/docs/guides/bq-preview.webp b/src/assets/bq-preview.webp → src/content/docs/guides/bq-preview.webp
diff --git a/src/content/docs/guides/getting-started.md → src/content/docs/guides/getting-started.mdx b/src/content/docs/guides/getting-started.md → src/content/docs/guides/getting-started.mdx
diff --git a/src/content/docs/guides/guided-tour.mdx b/src/content/docs/guides/guided-tour.mdx
@@ -17,13 +17,13 @@ If you are new to BigQuery, then the [Getting Started guide](../getting-started/
 Migration Guides:
 
 - If you are looking to adapt older HTTP Archive queries, written in [Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql), then you may find this [migration guide](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql) helpful.*
-- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `all` dataset](/guides/migrating-to-all-dataset/).
+- If you've been working with the deprecated dataset `pages` or `requests`, there is a guide on [migrating your queries to the `crawl` dataset](/guides/migrating-to-crawl-dataset/).
 
 This guide is split into multiple sections, each one focusing on different tables in the HTTP Archive. Each section builds on top of the previous one:
 
-1. [Exploring the `httparchive.all.pages` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
-2. [Exploring the `httparchive.all.requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
-3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
+1. [Exploring the `httparchive.crawl.pages` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb)
+2. [Exploring the `httparchive.crawl.requests` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb)
+3. [JOINing `pages` and `requests` tables](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb)
 
 :::caution
 HTTP Archive uses clustered tables. BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for bytes to be processed when querying clustered tables. For your information the actual bytes processed amount is provided in a comment for each query.

diff --git a/.../docs/guides/migrating-to-all-dataset.mdx → ...ocs/guides/migrating-to-crawl-dataset.mdx b/.../docs/guides/migrating-to-all-dataset.mdx → ...ocs/guides/migrating-to-crawl-dataset.mdx
@@ -1,17 +1,17 @@
 ---
-title: Migrate queries to `all` dataset
+title: Migrate queries to `crawl` dataset
 description: Assisting with query migration to the new dataset
 ---
 
 import { Tabs, TabItem } from '@astrojs/starlight/components';
 
-New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `all` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.
+New tables have been introduced in the HTTP Archive dataset, which are more efficient and easier to use. The `crawl` dataset contains all the data from the previous `pages`, `requests`, and other datasets. This guide will help you migrate your queries to the new dataset.
 
-## Migrating to `all.pages`
+## Migrating to `crawl.pages`
 
 ### Page data schemas comparison
 
-previously | `all.pages`
+previously | `crawl.pages`
 ---|---
 date in a table name | [`date`](/reference/tables/pages/#date)
 client as `_TABLE_SUFFIX` | [`client`](/reference/tables/pages/#client)
@@ -41,8 +41,9 @@ SELECT
   type,
   id
 FROM `httparchive.blink_features.features`
-WHERE yyyymmdd = DATE('2024-05-01')
-  AND client = 'desktop'
+WHERE
+  yyyymmdd = DATE('2024-05-01') AND
+  client = 'desktop'
 ```
   </TabItem>
   <TabItem label="After">
@@ -52,11 +53,12 @@ SELECT
   features.feature,
   features.type,
   features.id
-FROM `httparchive.all.pages`,
+FROM `httparchive.crawl.pages`,
 UNNEST (features) AS features
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -77,11 +79,12 @@ FROM `httparchive.lighthouse.2024_06_01_desktop`
 /* This query will process 17 TB when run. */
 SELECT
   page,
-  JSON_QUERY(lighthouse, '$.audits.largest-contentful-paint.numericValue') AS LCP,
-FROM `httparchive.all.pages`
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+  lighthouse.audits.`largest-contentful-paint`.numericValue AS LCP,
+FROM `httparchive.crawl.pages`
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -107,10 +110,11 @@ SELECT
   client,
   wptid,
 -- JSON with the results of the custom metrics,
-  JSON_QUERY(custom_metrics, '$.privacy') AS custom_metrics,
-FROM `httparchive.all.pages`
-WHERE date = '2022-06-01'
-  AND is_root_page
+  custom_metrics.privacy AS custom_metrics,
+FROM `httparchive.crawl.pages`
+WHERE
+  date = '2022-06-01' AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -125,31 +129,26 @@ SELECT
   COUNT(0) pages,
   ROUND(AVG(reqTotal),2) avg_requests,
 FROM `httparchive.summary_pages.2024_06_01_desktop`
-GROUP BY
-  numDomains
-HAVING
-  pages > 1000
-ORDER BY
-  numDomains ASC
+GROUP BY numDomains
+HAVING pages > 1000
+ORDER BY numDomains ASC
 ```
   </TabItem>
   <TabItem label="After">
 ```sql
 /* This query will process 110 GB when run. */
 SELECT
-  CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS numDomains,
+  INT64(summary.numDomains) AS numDomains,
   COUNT(0) pages,
-  ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)),2) as avg_requests,
-FROM `httparchive.all.pages`
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
-GROUP BY
-  numDomains
-HAVING
-  pages > 1000
-ORDER BY
-  numDomains ASC
+  ROUND(AVG(INT64(summary.reqTotal)),2) as avg_requests,
+FROM `httparchive.crawl.pages`
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
+GROUP BY numDomains
+HAVING pages > 1000
+ORDER BY numDomains ASC
 ```
   </TabItem>
 </Tabs>
@@ -175,21 +174,22 @@ SELECT
   technologies.categories,
   technologies.technology,
   technologies.info
-FROM `httparchive.all.pages`,
+FROM `httparchive.crawl.pages`,
 UNNEST (technologies) AS technologies
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 ```
 
   </TabItem>
 </Tabs>
 
-## Migrating to `all.requests`
+## Migrating to `crawl.requests`
 
 ### Request data schemas comparison
 
-previously | `all.requests`
+previously | `crawl.requests`
 ---|---
 date in a table name | [`date`](/reference/tables/requests/#date)
 client as `_TABLE_SUFFIX` | [`client`](/reference/tables/requests/#client)
@@ -218,22 +218,24 @@ SELECT
   JSON_VALUE(request_headers, '$.value') AS header_value,
 FROM `httparchive.almanac.requests`,
 UNNEST(JSON_QUERY_ARRAY(request_headers)) AS request_headers
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND firstHtml
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  firstHtml
 ```
   </TabItem>
   <TabItem label="After">
 ```sql
 SELECT
   LOWER(request_headers.name) AS header_name,
   request_headers.value AS header_value,
-FROM `httparchive.all.requests`,
+FROM `httparchive.crawl.requests`,
 UNNEST(request_headers) AS request_headers
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_main_document
-  AND is_root_page
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_main_document AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -256,12 +258,13 @@ FROM `httparchive.requests.2024_06_01_desktop`
 SELECT
   page,
   url,
-  JSON_VALUE(summary, '$.mimeType') AS mimeType,
-  CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64) AS respBodySize,
-FROM `httparchive.all.requests`
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+  STRING(summary.mimeType) AS mimeType,
+  INT64(summary.respBodySize) AS respBodySize,
+FROM `httparchive.crawl.requests`
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -286,10 +289,11 @@ SELECT
   page,
   url,
   BYTE_LENGTH(response_body) AS bodySize
-FROM `httparchive.all.requests`
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+FROM `httparchive.crawl.requests`
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 ```
   </TabItem>
 </Tabs>
@@ -313,12 +317,13 @@ ORDER BY responseSize100KB ASC
 ```sql
 /* This query will process 10 TB when run. */
 SELECT
-  ROUND(CAST(JSON_VALUE(summary, '$.respBodySize') AS INT64)/1024/100)*100 AS responseSize100KB,
+  ROUND(INT64(summary.respBodySize)/1024/100)*100 AS responseSize100KB,
   COUNT(0) requests,
-FROM `httparchive.all.requests`
-WHERE date = '2024-06-01'
-  AND client = 'desktop'
-  AND is_root_page
+FROM `httparchive.crawl.requests`
+WHERE
+  date = '2024-06-01' AND
+  client = 'desktop' AND
+  is_root_page
 GROUP BY responseSize100KB
 HAVING responseSize100KB > 0
 ORDER BY responseSize100KB ASC

diff --git a/src/content/docs/guides/minimizing-costs.md b/src/content/docs/guides/minimizing-costs.md
@@ -9,10 +9,10 @@ The HTTP Archive dataset is large and complex, and it's easy to write queries th
 
 Table | Partitioned by | Clustered by
 --- | --- | ---
-`httparchive.all.pages` | `date` | `client`<br>`is_root_page`<br>`rank`
-`httparchive.all.requests` | `date` | `client`<br>`is_root_page`<br>`is_main_document`<br>`type`
+`httparchive.crawl.pages` | `date` | `client`<br>`is_root_page`<br>`rank`<br>`page`
+`httparchive.crawl.requests` | `date` | `client`<br>`is_root_page`<br>`is_main_document`<br>`type`
 
-For example, the `httparchive.all.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, and `rank` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.
+For example, the `httparchive.crawl.pages` table is [partitioned](https://cloud.google.com/bigquery/docs/partitioned-tables) by `date` and [clustered](https://cloud.google.com/bigquery/docs/clustered-tables) by the `client`, `is_root_page`, `rank` and `page` columns, which means that queries that filter on these columns will be much faster and cheaper than queries that don't.
 
 :::caution
 BigQuery [doesn't guarantee](https://cloud.google.com/bigquery/docs/clustered-tables#clustered_table_pricing:~:text=BigQuery%20might%20not%20be%20able%20to%20accurately%20estimate%20the%20bytes%20to%20be%20processed) accuracy of estimations for 'Bytes processed' when querying clustered tables ([Issue Link](https://issuetracker.google.com/issues/176795805)). The actual data volume may be smaller than the amount provided in the estimate.
@@ -27,7 +27,7 @@ Filter by the top 1k websites. This is the smallest rank bucket and will result
 SELECT
   page
 FROM
-  `httparchive.all.pages`
+  `httparchive.crawl.pages`
 WHERE
   date = '2023-05-01' AND
   client = 'desktop' AND
@@ -44,9 +44,9 @@ For example, without `TABLESAMPLE`:
 
 ```sql
 SELECT
-  JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
+  custom_metrics.other.avg_dom_depth
 FROM
-  `httparchive.all.pages`
+  `httparchive.crawl.pages`
 WHERE
   date = '2023-05-01' AND
   client = 'desktop'
@@ -58,9 +58,9 @@ However, the same query with `TABLESAMPLE` at 0.01% is much cheaper:
 
 ```sql
 SELECT
-  JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
+  custom_metrics.other.avg_dom_depth
 FROM
-  `httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
+  `httparchive.crawl.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
 WHERE
   date = '2023-05-01' AND
   client = 'desktop'
@@ -77,9 +77,9 @@ For example, this query still processes 6.56 TB:
 
 ```sql
 SELECT
-  JSON_VALUE(custom_metrics, '$.avg_dom_depth') AS dom_depth
+  custom_metrics.other.avg_dom_depth
 FROM
-  `httparchive.all.pages`
+  `httparchive.crawl.pages`
 WHERE
   date = '2023-05-01' AND
   client = 'desktop'
@@ -91,16 +91,16 @@ LIMIT
 
 ## Use the `sample_data` dataset
 
-The `sample_data` dataset contains 1k and 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
+The `sample_data` dataset contains 10k subsets of the full pages and requests tables. These tables are useful for testing queries before running them on the full dataset, without the risk of incurring a large query cost.
 
-Table names correspond to their full-size counterparts of the form `[table]_[client]_10k` for the legacy tables or `[table]_1k` for the newer `all.pages` and `all.requests` tables. For example, to query the summary data for the subset of 10k pages, you would use the `httparchive.sample_data.summary_pages_desktop_10k` table.
+Table names correspond to their full-size counterparts of the form `[table]_1k` for `crawl.pages` and `crawl.requests` tables. For example, to query the summary data for the subset of 10k pages, you would use the `httparchive.sample_data.pages_10k` table.
 
 ## Use table previews
 
 BigQuery allows you to preview entire rows of a table without incurring a query cost. This is useful for getting a rough idea of the data in a table before running a more expensive query.
 
-![Preview tab on BigQuery](../../../assets/bq-preview.webp)
+![Preview tab on BigQuery](./bq-preview.webp)
 
 To access the preview, click on a table name from the workspace explorer and select the **Preview** tab.
 
-Note that generating the preview may be slow for tables with large payloads, like `response_bodies` or `pages`. Also note that the text values are truncated by default, so you will need to expand the field to get the full value.
+Note that generating the preview may be slow for these tables as they include large payloads. Also note that the text values are truncated by default, so you will need to expand the field to get the full value.
diff --git a/src/content/docs/guides/numDomains_requests_graph.png b/src/content/docs/guides/numDomains_requests_graph.png
diff --git a/src/content/docs/guides/release-cycle.md → src/content/docs/guides/release-cycle.mdx b/src/content/docs/guides/release-cycle.md → src/content/docs/guides/release-cycle.mdx
@@ -5,7 +5,8 @@ description: Learn about the process of testing millions of web pages each month
 
 The HTTP Archive dataset is updated each month with data from millions of web pages. This guide explores the end-to-end release cycle from sourcing URLs to publishing results to BigQuery.
 
-_TODO: Add a diagram_
+[](https://www.plantuml.com/plantuml/uml/RL5DRnCn4BtxLppr60aaEFQ0Aie9eHAQ5BXExEckXMCRUvme5tuxpdgtkrNNKYpvPTxi-xZBGadAqIb3GWVAZFlqz5l5Ybfj8tcf09qTfrVOraPsrZCeu-PBfRuWDujDBXIpav2eQuC3W8RKGHsSOoqs-8n7ZY59dicVRVUZSBeezH244KwS1cctAB4EiK7m-EWDzeMpeGl2CwHd78ENNgbHzBjFZRCB9Udc3K-Ftx8YBVP4mY_kK8-QJApHZYp9wWLp6bOBscZZ5jjoS3Rtk0-9yOiF-6c5NCQUTU-32zq5QPXLXjziR6Aw54fi-l2tS66OakWQ5_xX0yxCVuP15q94v8G3YUwlEKJgE2kCPuvYqKVrHgT1sRQ-zfm5ShqIv-8au-lk-yFR3LCfZVsQORq4PA7E-WwrH3TAO6zK_IqgcMpYTdJtR7tDYatpRNYzd9R7sKflFGYrym5VJszwp9hdJWm9GG8scruaKjAzFV5xVVtMPZFycrdcBUl5jlQQnPKA1yjtzIf7zny0)
+![Release cycle diagram](./release_cycle_diagram.svg)
 
 ## Sourcing URLs
 
@@ -18,17 +19,13 @@ CrUX also includes origins without any distinct form factor data. HTTP Archive c
 Previously, HTTP Archive would start testing each web page (the crawl) on the first of the month. Now, to be in closer alignment with the upstream CrUX dataset, HTTP Archive starts testing pages as soon as the CrUX dataset is available on the second Tuesday of each month. Crawl dates are always rounded down to the first of the month, regardless of which day they actually started. For example, the June 2023 crawl kicks off on the 13th of the month, but the dataset would be accessible on BigQuery under the date `2023-06-01`.
 
 :::note
-As of [May 2023](https://httparchive.org/reports/state-of-the-web?start=2023_04_01&end=2023_05_01&view=list#numUrls) there are 16.6 million mobile pages and 12.8 million desktop pages. It takes 1–2 weeks to test all of these pages, so the crawl is usually complete by the end of the month.
+As of [May 2023](https://httparchive.org/reports/state-of-the-web?start=2023_04_01&end=2023_05_01&view=list#numUrls) there are 16.6 million mobile pages and 12.8 million desktop pages. It takes 1–2 weeks to test all of these pages, so the crawl is usually complete in the second half of the month.
 :::
 
 ## Publishing the raw data
 
-As each page's test results are completed, the raw data is saved to a public Google Cloud Storage bucket. Once the crawl is complete, the data is processed and published to BigQuery. The BigQuery dataset is available to the public for analysis.
-
-There isn't currently a way to be notified when a new crawl is available to query.
+As each page's test results are completed, the raw data is saved to a public Google Cloud Storage bucket. Once the crawl is complete, the data is processed and published to BigQuery. The `httparchive.crawl` dataset is available to the public for analysis.
 
 ## Generating reports
 
-The reports on the HTTP Archive website are automatically generated as soon as the BigQuery data is available.
-
-Auxilliary reports like the [Core Web Vitals Technology Report](https://cwvtech.report/) are generated manually soon after the data becomes available.
+The reports on the [HTTP Archive website](https://httparchive.org/reports) and auxilliary ones like the [Core Web Vitals Technology Report](https://httparchive.org/reports/techreport/landing) are automatically generated as soon as the data is available in BigQuery.